Re: A12: Strings

Larry Wall Wed, 21 Apr 2004 10:21:20 -0700

On Wed, Apr 21, 2004 at 11:04:02AM +0100, Tim Bunce wrote:
: > Hashes should handle various types of built-in key strings properly
: > by default.
: 
: What is "properly" for string?


The way it oughta, whatever that is...  I was aiming to set policy
rather than implementation there.  :-)

: Is it to hash the "sequence of integers"
: as if they're 32 bits wide even if they're less?  Is that sufficient?

That would be one way.  The point being that the hash mustn't tell
you that two strings are different when they would compare the same,
even if they are in different internal representations to begin with.
It's okay if the hash occassionally says two strings are the same when
in fact they'd compare differently.  The actual weakness is likely to
be in the definition of comparison rather than the definition of the
hash function, especially if we let people specify the standards of
comparison for the hash keys.  That says that the hash function has
to either be weaker than the weakest specifiable comparison, or we have
to be able to "weaken" the hash such that it doesn't lie about what
might match.  That sounds like research...

Well, it's probably not that bad.  Much like with other sorting
problems, all you have to do is keep track of a canonicalized key
in addition to the "real" key.  The hash is always calculated off of
the canonicalized key rather than the actual key.  (Whether you choose
to store or recreate the canonical key is one of those space/time
tradeoffs that "use less" was originally intended to solve...)

If Unicode makes your brain hurt, just think of it in terms of
case sensitivity.  We could have a hash that was case insensitive
by always calculating the hash on a lower-cased key, and by doing
comparisons between lower-cased keys (notionally, at least).

So in Unicode terms, there are probably some speed benefits if you
know your keys are already canonicalized to the form required by
the comparison and hash functions.  That implies that the state
of canonicalization must be strongly typed (presumably dynamically
in Perl).  Canonicalization is one of those things you really don't
want to do redundantly.

Larry

Re: A12: Strings

Reply via email to