2009/10/24 Andres Valloud <[email protected]>:
> Nicolas,
>> Yes, large Array hash has severe performance penalty anyway.
>> Maybe we shall not hash each and every element.
>>
>
> I disagree.  The issue is that if hashing a very large array (or string)
> becomes a performance problem, then we should engineer applications so
> that we hash the metadata as opposed to the data.
>
> In other words, assume you're hashing arbitrary 1kb strings.  You can't
> have more than ~1 million of those in memory at any given time, and
> there are only so many you can store in a database.  Chances are those
> 1kb strings represent chunks of data that can be uniquely identified
> with no more than e.g.: 64 bits.  Then, just hash those 64 bits and be
> done with it.  In the absence of assumptions on the hashed data, though,
> you have no option other than to hash everything.
>

Good point.

Where is the balance between expensive hash and expensive collisions ?

The cost of collisions is
- perform hash and = on each successive keys untill hash is different
or = is true

Omitting a single byte in a ByteString leads to 256 possible collisions...

So with this POV, it is better to spend time in hash to reduce the
cost of worst-case distribution of keys with default hash algorithm.
My POV was to reduce cost for some distribution of keys, without
caring too much about worst case... One then has to provide its own
hash function for optimizing its problem... But maybe that was not a
good tradeoff...

In any case, restricting hash to leading characters is the worst
choice because it leads to late detection of string1 ~= string2 for
strings of equal hash...

>> And, even forgetting about equality with Array, Inteval>>#= and
>> Interval>>#hash do not agree in Squeak and Pharo:
>>
>>       (1 to: 10 by: 5)  = (1 to: 9 by: 5). -> true
>>       (1 to: 10 by: 5) hash = (1 to: 9 by: 5) hash. -> false
>>
>>
>
> Sometimes the end points of intervals are important... maybe it's better
> to have equality fail than "fixing" hash in this case.
>
> Andres.
>

Either changing Interval>>#= or Interval>>#hash is NECESSARY.
I note that Interval is used for Text selection, and in this context
(3 to: 2) is different from (4 to: 3), though they both equal #().
But this usage is related to another oddity (empty Interval have a
first and a last element !).

IMO, we should not let Array hash ~= Interval hash coexist with Array = Interval
Because it's like putting some traps on programmers path.
One day or the other an application will exhibit a non repeatable
error, just because the size of a Set was different and caused a
collision.

A general purpose library has to be robust, reliable and provide
repeatable behavior.
If behavior changes just because of allocated size of a container,
then it is a bad library.

I feel OK with the solution of having Array ~= Interval, if we have to
trade this feature against spoiling hashedCollection efficiency.
Of course, this could impact some code...

Nicolas

> _______________________________________________
> Pharo-project mailing list
> [email protected]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Reply via email to