To answer your original question though, the string's hash function *has* to
be based on the string value, not the object identity.  (So interning or
lack thereof will not affect you.)  This is because the String class
overrides Equals.  Any class that overrides Equals is *required* to override
GetHashCode too and to make the two consistent.  (I.e. if Equals says two
objects are the same, they must have the same hash values.)

However, just because two strings have the same hash code doesn't mean they
are the same - GetHashCode *is* allowed to return false positives.  (It's
just false negatives that aren't allowed.)  In fact here is a guaranteed
valid-for-every-occasion implementation of GetHashCode():

  public override int GetHashCode() { return 1; }

This doesn't break any rules.  Of course it won't yield great performance if
you use this on objects which are acting as keys in a HashTable...  And
it'll get you rather a lot of false positives for comparison.  :-)


But *any* hash has some chance of collision, unless the hash actually has at
least as many bits as the data being hashed.  The larger the hash, the lower
the probability of collision will be (given a suitably good hash algorithm
of course...) but the probability never drops to zero.  MD5 is believed to
be a good hash algorithm, and it has a length of 128 bytes (i.e. 1024 bits)
IIRC, which gives a pretty low chance of a collision.

But the question I would ask is are you actually wasting more space by
storing MD5 hashes than you would be by just remembering the strings in the
first place?  Unless your strings are on average longer than 128 bytes (the
size of an MD5 hash) then it will actually be rather more expensive to do
what you're doing than just to store all the strings...  Or is there some
reason you can't store the strings?


--
Ian Griffiths
DevelopMentor

----- Original Message -----
From: "Erick Thompson" <[EMAIL PROTECTED]>


> Upon further reflection, I think that using GetHashCode isn't a good idea
to
> check if I've seen a string before. I think I'll use a MD5 sig instead, so
I
> don't have to worry about collisions.
>
> Erick
>
> ----- Original Message -----
> From: "Erick Thompson" <[EMAIL PROTECTED]>
>
> > I need to check if a particular string has been seen before, so I
> > was going to maintain a list of hashcodes from the strings.
> > However, I want to confirm that the GetHashCode method
> > returns a hashcode based on the string contents, not on the
> > string object. The reason I'm not completely sure is that I'm
> > not totally confident I know how string interning will effect things.
> > Can I use GetHashCode to identify a string?

You can read messages from the DOTNET archive, unsubscribe from DOTNET, or
subscribe to other DevelopMentor lists at http://discuss.develop.com.

Reply via email to