Re: Hashing algorythm for text strings without embedded blanks

Joel C. Ewing Tue, 27 Jul 2010 18:13:20 -0700

Explanation of hashing functions and table look up:
  If you have a table with entries identified by a key, one of the
typical requirements is to locate a table entry having some given key
(or prove the entry doesn't exist).
  If the range of key values is suitably small so that all key values
may be uniquely mapped to a distinct numeric value with a modest range,
then the simplest solution may be an array using that unique value as a
direct index into the array.  But, in many cases the range of possible
key values makes that totally impossible --  e.g.,  an 8-byte,
unconstrained key could assume 2**64 possible values which would be
impractical for a directly-indexed, in-memory table.
  One solution in such cases is to instead use a hashing function to
transform the key domain to a relatively small, non-unique numeric range
that is only slightly larger (say 125%) than the maximum number of
distinct table entries to be stored, and to use a "hash table" sized to
match the hash function range.  The numeric hash value rather than the
actual key is used as a direct index into the table array.  Since the
hash function range is smaller than the key domain, multiple keys can
map to a single hash value, so the initial table entry located might be
for a different key.  To make look up work, each table entry must also
contain the actual key value for the entry, and there must be a rule (or
perhaps a pointer) that tells where to look next in the table for cases
where there is a "collision" when the hash function points to a used
location in the table for a different key.  A number of different
schemes are used for handling collisions, but basically you try other
table entry locations until you either find an empty location or one
with the right key.
  The beauty of this scheme is that with a suitable hash function,
suitably distributed key values, suitable collision handling rules, and
a table that is guaranteed to never become over 80% full, it can be
proved that the average number of key comparisons to locate an entry
will never exceed around 2.5. It doesn't matter how big the table is as
long as the 80% rule is not violated.  A binary search requires an
average of 2.4 key comparisons to locate all entries in a  7-entry
table,  3.2 comparisons for a 15-entry table, and the average increases
as the number of entries increases.  Even with a less than optimal
hashing function and a less than optimal distribution of key values it
is easy to see how hash table lookup can run circles around binary
search for repeated random lookups as the total number of entries
becomes large.
  The disadvantage is that if at the end of processing you want to list
all table entries in any semblance of order, a sorting process will be
required; and depending on implementation technique, you may even have
to hunt through a sparsely filled table just to find all the entries.
   Joel C Ewing


On 07/27/2010 02:43 PM, zMan wrote:
> OK, I have a splitting headache and may be dense anyway: how is it easier to
> look it up once it's hashed? What am I missing?
> 
> On Tue, Jul 27, 2010 at 3:36 PM, Binyamin Dissen <bdis...@dissensoftware.com
>> wrote:
> 
>> On Tue, 27 Jul 2010 15:32:24 -0400 zMan <zedgarhoo...@gmail.com> wrote:
>>
>> :>What's the goal of the hashing -- obfuscation? Shortening the data?
>>
>> Lookup.
>>
>> Does the entry exist, and if so what are its attributes. If does not exist,
>> add.
>>
>> :>On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen <
>> bdis...@dissensoftware.com
>> :>> wrote:
>>
>> :>> Is there a preferred hashing algorithm for such strings? The strings
>> will
>> :>> be
>> :>> fixed length with trailing blanks.
>>
>> :>> I was thinking of simply doing a MOD64 (or higher) of the sum of the
>> :>> non-blank
>> :>> part of the string by words, but as this is text with a limited
>> character
>> :>> set,
>> :>> would this lead to excessive alias chains? Has research been done in
>> this
>> :>> area? There will mostly be search activity but also add activity.
>>
>> --
>> Binyamin Dissen <bdis...@dissensoftware.com>
>> http://www.dissensoftware.com
>>
>> Director, Dissen Software, Bar & Grill - Israel
>>
>>
>> Should you use the mailblocks package and expect a response from me,
>> you should preauthorize the dissensoftware.com domain.
>>
>> I very rarely bother responding to challenge/response systems,
>> especially those from irresponsible companies.
...

-- 
Joel C. Ewing, Fort Smith, AR        jcew...@acm.org

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Re: Hashing algorythm for text strings without embedded blanks

Reply via email to