Ralf Hemmecke wrote:
> 
> >> I've implemented (A) (a counter that tells me when every position 
> >> had been checked).
> 
> > I am affraid that this is not good solution.
> 
> As I said. That is only to ensure termination of the search.
> 
> > Namely, before you get every position taken by an entry or deleted 
> > marker you will have period of time when effective load factor for 
> > insertiong new entries is close to one.
> 
> Not necessarily. (Well, it depends, what you mean by "effective load
> factor".)

For searching deleted entries behave like occupied entries.
Assuming that deleted entries are randomly mixed with
occupied entires expected time to look up an entry in
table os size N with n occupied entries and m deleted
entries will be the same as time to look up an entry
in table having m + n occupied entries.  In other words,
for searches the table will behave like a table without
deletion with load factor (m + n)/N.  For failing
serarches (in particular inserting new entry) the
same.

My point is that to get table of size N full of deleted
entries you need to do at least N insertions.  More precisely,
for each k < N you need to do an insertion to table having
k occupied and deleted entiries (of course, if some insertions
go into deleted slots you may do more insertions, but that
is lower bound).  Expected search length for insertion is
as for table with load factor k/N, that is (assuming random
distribution of keys) in simplest model (N-k)/k.  Summing
that for k=1 to N-1 gives you value of order Nlog(N), that
is average for this seqenece is of order of log(N).

Of course, in real life seqence of insertion and deletions
is non-random and hash function is non-random too.  However,
non-randomness of hash function normally makes thing only
worse.  In principle you can do large number of insertions
and deletions keeping number of occupied and deleted
entiries at m (if large number of insertion happen to
find deleted slot to go in), however as long as m and
table size are independent it does not change estimate
for average (just average over m), but may lead to quite
bad worst case if m = N - 1.

You may get better behaviour if most searches goes into
small number of entries that are inserted at start and
conseqently have small search length.  That is good
case, but bad case seem at least as likely (if not
more likely) than this one.

> > Crude estimate of runtime indicates that average time to insert an
> > entry will be of order of logarithm of table size.
> 
> Yep, that should be the case, but with double hashing one can get
> surprising results. I think it was on a test with 1000 entries (1..1000)
> where there was one entry with probing length 22, but none with probing
> length 21 or 20 or 19. (no deletion)

I am writing about expectation (assuming uniform distribution).

> > IMHO we should rehash if effective load factor for insertions (that 
> > is ratio of number of occupied and deleted entires to the table
> > size) exceeds some fixed treshold (say 0.8 or 0.9).  If load factor
> > is small enough (say < 0.6) then keep size, otherwise grow the
> > table.
> 
> OK, I'll try to implement that. To make it clear, do you want me to set
> my current load factor from 0.7 to 0.6. In fact, I thought about
> lowering it. The suggestion of 0.75 at
> http://docs.oracle.com/javase/1.4.2/docs/api/java/util/Hashtable.html
> might be good when all entries with the same hash go into a common
> bucket list, but with double hashing the probing might hit entries that
> actually come from a totally different bucket position (i.e. probing
> sequences mix somehow).
> So it's probably better to keep the load factor closer to 1/2.

Actually I was considering having two treshholds.  Ignoring deletion
there is tradeoff between memory use and search length.  Note that
with treshold 0.7 and doubling for growth _expected_ load factor is
of order 0.5 (during runtime it varies between 0.35 and 0.7).  Also,
bigger tables have lower chance of fitting into caches, which means
that bigger table with theoretically lower search length may
actually perform worse than smaller table with larger search length.
So for insert-only table 0.7 would actually be rather low.

When we add deletions there is second tradeoff: between expected
search length and time spent rehashing.  IMHO with many deletions
it is reasonable to assume that table is in near stationary state,
in particular that in the future there will be many insetions
and deletions while load factor will stay nearly constant.  Now,
having k occupied entries rehasing will cost k searches.  It
pays if decrease in search length is more than search length
of rehashing.  The factors I gave are a (rather conservative)
guess at values which make rehashing a win.  Also, the upper
factor is intended to keep search length within reasonable
bounds, while the lower one is a compromise between memory
use and time spent rehashing.   Note that for table with
many deletions search lenght is depends on effective load
factor, so tradeoff between time and memory use is different
than for insert-only tables.  So is makes sense to use
different treshold for growth.   Let me stress that factors
I gave should give reasonable behaviour, but experience
or better estimates may find better factors.

-- 
                              Waldek Hebisch
[email protected] 

-- 
You received this message because you are subscribed to the Google Groups 
"FriCAS - computer algebra system" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/fricas-devel?hl=en.

Reply via email to