Nicolas,

> Where is the balance between expensive hash and expensive collisions ?
>
> The cost of collisions is
> - perform hash and = on each successive keys untill hash is different
> or = is true
>
> Omitting a single byte in a ByteString leads to 256 possible collisions...
>
> So with this POV, it is better to spend time in hash to reduce the
> cost of worst-case distribution of keys with default hash algorithm.
>   

In general, yes.

> My POV was to reduce cost for some distribution of keys, without
> caring too much about worst case... One then has to provide its own
> hash function for optimizing its problem... But maybe that was not a
> good tradeoff...
>
> In any case, restricting hash to leading characters is the worst
> choice because it leads to late detection of string1 ~= string2 for
> strings of equal hash...
>   

Yes.  Filenames and URLs are particularly problematic.  Fortunately they 
tend to be O(100) in size, as opposed to O(1000) in size.  Thus, the 
less demanding path of just hashing the characters is efficient enough 
for the cases I've seen so far.

> IMO, we should not let Array hash ~= Interval hash coexist with Array = 
> Interval
> Because it's like putting some traps on programmers path.
>   

Yes, I agree.  I'd vote on breaking #=, with the caveat that chances are 
some code will be broken.  Does Pharo have a release strategy that 
allows more fundamental changes to go in, as opposed to 
maintenance/improvement fixes?

Andres.

_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Reply via email to