Ralf Hemmecke wrote: > > >> I've implemented (A) (a counter that tells me when every position > >> had been checked). > > > I am affraid that this is not good solution. > > As I said. That is only to ensure termination of the search. > > > Namely, before you get every position taken by an entry or deleted > > marker you will have period of time when effective load factor for > > insertiong new entries is close to one. > > Not necessarily. (Well, it depends, what you mean by "effective load > factor".)
For searching deleted entries behave like occupied entries. Assuming that deleted entries are randomly mixed with occupied entires expected time to look up an entry in table os size N with n occupied entries and m deleted entries will be the same as time to look up an entry in table having m + n occupied entries. In other words, for searches the table will behave like a table without deletion with load factor (m + n)/N. For failing serarches (in particular inserting new entry) the same. My point is that to get table of size N full of deleted entries you need to do at least N insertions. More precisely, for each k < N you need to do an insertion to table having k occupied and deleted entiries (of course, if some insertions go into deleted slots you may do more insertions, but that is lower bound). Expected search length for insertion is as for table with load factor k/N, that is (assuming random distribution of keys) in simplest model (N-k)/k. Summing that for k=1 to N-1 gives you value of order Nlog(N), that is average for this seqenece is of order of log(N). Of course, in real life seqence of insertion and deletions is non-random and hash function is non-random too. However, non-randomness of hash function normally makes thing only worse. In principle you can do large number of insertions and deletions keeping number of occupied and deleted entiries at m (if large number of insertion happen to find deleted slot to go in), however as long as m and table size are independent it does not change estimate for average (just average over m), but may lead to quite bad worst case if m = N - 1. You may get better behaviour if most searches goes into small number of entries that are inserted at start and conseqently have small search length. That is good case, but bad case seem at least as likely (if not more likely) than this one. > > Crude estimate of runtime indicates that average time to insert an > > entry will be of order of logarithm of table size. > > Yep, that should be the case, but with double hashing one can get > surprising results. I think it was on a test with 1000 entries (1..1000) > where there was one entry with probing length 22, but none with probing > length 21 or 20 or 19. (no deletion) I am writing about expectation (assuming uniform distribution). > > IMHO we should rehash if effective load factor for insertions (that > > is ratio of number of occupied and deleted entires to the table > > size) exceeds some fixed treshold (say 0.8 or 0.9). If load factor > > is small enough (say < 0.6) then keep size, otherwise grow the > > table. > > OK, I'll try to implement that. To make it clear, do you want me to set > my current load factor from 0.7 to 0.6. In fact, I thought about > lowering it. The suggestion of 0.75 at > http://docs.oracle.com/javase/1.4.2/docs/api/java/util/Hashtable.html > might be good when all entries with the same hash go into a common > bucket list, but with double hashing the probing might hit entries that > actually come from a totally different bucket position (i.e. probing > sequences mix somehow). > So it's probably better to keep the load factor closer to 1/2. Actually I was considering having two treshholds. Ignoring deletion there is tradeoff between memory use and search length. Note that with treshold 0.7 and doubling for growth _expected_ load factor is of order 0.5 (during runtime it varies between 0.35 and 0.7). Also, bigger tables have lower chance of fitting into caches, which means that bigger table with theoretically lower search length may actually perform worse than smaller table with larger search length. So for insert-only table 0.7 would actually be rather low. When we add deletions there is second tradeoff: between expected search length and time spent rehashing. IMHO with many deletions it is reasonable to assume that table is in near stationary state, in particular that in the future there will be many insetions and deletions while load factor will stay nearly constant. Now, having k occupied entries rehasing will cost k searches. It pays if decrease in search length is more than search length of rehashing. The factors I gave are a (rather conservative) guess at values which make rehashing a win. Also, the upper factor is intended to keep search length within reasonable bounds, while the lower one is a compromise between memory use and time spent rehashing. Note that for table with many deletions search lenght is depends on effective load factor, so tradeoff between time and memory use is different than for insert-only tables. So is makes sense to use different treshold for growth. Let me stress that factors I gave should give reasonable behaviour, but experience or better estimates may find better factors. -- Waldek Hebisch [email protected] -- You received this message because you are subscribed to the Google Groups "FriCAS - computer algebra system" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/fricas-devel?hl=en.
