Re: Hashing algorythm for text strings without embedded blanks
Explanation of hashing functions and table look up: If you have a table with entries identified by a key, one of the typical requirements is to locate a table entry having some given key (or prove the entry doesn't exist). If the range of key values is suitably small so that all key values may be uniquely mapped to a distinct numeric value with a modest range, then the simplest solution may be an array using that unique value as a direct index into the array. But, in many cases the range of possible key values makes that totally impossible -- e.g., an 8-byte, unconstrained key could assume 2**64 possible values which would be impractical for a directly-indexed, in-memory table. One solution in such cases is to instead use a hashing function to transform the key domain to a relatively small, non-unique numeric range that is only slightly larger (say 125%) than the maximum number of distinct table entries to be stored, and to use a "hash table" sized to match the hash function range. The numeric hash value rather than the actual key is used as a direct index into the table array. Since the hash function range is smaller than the key domain, multiple keys can map to a single hash value, so the initial table entry located might be for a different key. To make look up work, each table entry must also contain the actual key value for the entry, and there must be a rule (or perhaps a pointer) that tells where to look next in the table for cases where there is a "collision" when the hash function points to a used location in the table for a different key. A number of different schemes are used for handling collisions, but basically you try other table entry locations until you either find an empty location or one with the right key. The beauty of this scheme is that with a suitable hash function, suitably distributed key values, suitable collision handling rules, and a table that is guaranteed to never become over 80% full, it can be proved that the average number of key comparisons to locate an entry will never exceed around 2.5. It doesn't matter how big the table is as long as the 80% rule is not violated. A binary search requires an average of 2.4 key comparisons to locate all entries in a 7-entry table, 3.2 comparisons for a 15-entry table, and the average increases as the number of entries increases. Even with a less than optimal hashing function and a less than optimal distribution of key values it is easy to see how hash table lookup can run circles around binary search for repeated random lookups as the total number of entries becomes large. The disadvantage is that if at the end of processing you want to list all table entries in any semblance of order, a sorting process will be required; and depending on implementation technique, you may even have to hunt through a sparsely filled table just to find all the entries. Joel C Ewing On 07/27/2010 02:43 PM, zMan wrote: > OK, I have a splitting headache and may be dense anyway: how is it easier to > look it up once it's hashed? What am I missing? > > On Tue, Jul 27, 2010 at 3:36 PM, Binyamin Dissen > wrote: > >> On Tue, 27 Jul 2010 15:32:24 -0400 zMan wrote: >> >> :>What's the goal of the hashing -- obfuscation? Shortening the data? >> >> Lookup. >> >> Does the entry exist, and if so what are its attributes. If does not exist, >> add. >> >> :>On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen < >> bdis...@dissensoftware.com >> :>> wrote: >> >> :>> Is there a preferred hashing algorithm for such strings? The strings >> will >> :>> be >> :>> fixed length with trailing blanks. >> >> :>> I was thinking of simply doing a MOD64 (or higher) of the sum of the >> :>> non-blank >> :>> part of the string by words, but as this is text with a limited >> character >> :>> set, >> :>> would this lead to excessive alias chains? Has research been done in >> this >> :>> area? There will mostly be search activity but also add activity. >> >> -- >> Binyamin Dissen >> http://www.dissensoftware.com >> >> Director, Dissen Software, Bar & Grill - Israel >> >> >> Should you use the mailblocks package and expect a response from me, >> you should preauthorize the dissensoftware.com domain. >> >> I very rarely bother responding to challenge/response systems, >> especially those from irresponsible companies. ... -- Joel C. Ewing, Fort Smith, ARjcew...@acm.org -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: Hashing algorythm for text strings without embedded blanks
As I wrote, there is the issue of dynamically balancing the tree as random insertions come in. On Tue, 27 Jul 2010 15:43:08 -0500 Hal Merritt wrote: :>Am I missing something? Why not use a binary chop search? That is, each compare cuts the universe in half. For a million entries, you would need 20 or less compares. :> :>http://en.wikipedia.org/wiki/Binary_search_algorithm :> :> :>-Original Message- :>From: IBM Mainframe Discussion List [mailto:ibm-m...@bama.ua.edu] On Behalf Of Binyamin Dissen :>Sent: Tuesday, July 27, 2010 3:14 PM :>To: IBM-MAIN@bama.ua.edu :>Subject: Re: Hashing algorythm for text strings without embedded blanks :> :>On Tue, 27 Jul 2010 15:43:47 -0400 zMan wrote: :> :>:>OK, I have a splitting headache and may be dense anyway: how is it easier to :>:>look it up once it's hashed? What am I missing? :> :>Not easier, faster. :> :>Assume 10 strings. A serial search will take on average 5 compares. A :>mod64 hash should reduce it on average to under 800 compares, hash to slot and :>run the alias chain. :> :>A b-tree would be better but then there is the issue of balancing it while :>others are busy transversing it. :> :>:>On Tue, Jul 27, 2010 at 3:36 PM, Binyamin Dissen :>> wrote: :> :>:>> On Tue, 27 Jul 2010 15:32:24 -0400 zMan wrote: :> :>:>> :>What's the goal of the hashing -- obfuscation? Shortening the data? :> :>:>> Lookup. :> :>:>> Does the entry exist, and if so what are its attributes. If does not exist, :>:>> add. :> :>:>> :>On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen < :>:>> bdis...@dissensoftware.com :>:>> :>> wrote: :> :>:>> :>> Is there a preferred hashing algorithm for such strings? The strings :>:>> will :>:>> :>> be :>:>> :>> fixed length with trailing blanks. :> :>:>> :>> I was thinking of simply doing a MOD64 (or higher) of the sum of the :>:>> :>> non-blank :>:>> :>> part of the string by words, but as this is text with a limited :>:>> character :>:>> :>> set, :>:>> :>> would this lead to excessive alias chains? Has research been done in :>:>> this :>:>> :>> area? There will mostly be search activity but also add activity. -- Binyamin Dissen http://www.dissensoftware.com Director, Dissen Software, Bar & Grill - Israel Should you use the mailblocks package and expect a response from me, you should preauthorize the dissensoftware.com domain. I very rarely bother responding to challenge/response systems, especially those from irresponsible companies. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: Hashing algorythm for text strings without embedded blanks
On Tue, 27 Jul 2010 15:43:08 -0500, Hal Merritt wrote: >Am I missing something? Why not use a binary chop search? That is, each >compare cuts the universe in half. For a million entries, you would need 20 >or less compares. > >http://en.wikipedia.org/wiki/Binary_search_algorithm > Insertion? For John G.'s suggestion of CKSUM followed by modulo small prime: Does primality matter? Shouldn't if the range of CKSUM is spectrally neutral. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: Hashing algorythm for text strings without embedded blanks
Am I missing something? Why not use a binary chop search? That is, each compare cuts the universe in half. For a million entries, you would need 20 or less compares. http://en.wikipedia.org/wiki/Binary_search_algorithm -Original Message- From: IBM Mainframe Discussion List [mailto:ibm-m...@bama.ua.edu] On Behalf Of Binyamin Dissen Sent: Tuesday, July 27, 2010 3:14 PM To: IBM-MAIN@bama.ua.edu Subject: Re: Hashing algorythm for text strings without embedded blanks On Tue, 27 Jul 2010 15:43:47 -0400 zMan wrote: :>OK, I have a splitting headache and may be dense anyway: how is it easier to :>look it up once it's hashed? What am I missing? Not easier, faster. Assume 10 strings. A serial search will take on average 5 compares. A mod64 hash should reduce it on average to under 800 compares, hash to slot and run the alias chain. A b-tree would be better but then there is the issue of balancing it while others are busy transversing it. :>On Tue, Jul 27, 2010 at 3:36 PM, Binyamin Dissen > wrote: :>> On Tue, 27 Jul 2010 15:32:24 -0400 zMan wrote: :>> :>What's the goal of the hashing -- obfuscation? Shortening the data? :>> Lookup. :>> Does the entry exist, and if so what are its attributes. If does not exist, :>> add. :>> :>On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen < :>> bdis...@dissensoftware.com :>> :>> wrote: :>> :>> Is there a preferred hashing algorithm for such strings? The strings :>> will :>> :>> be :>> :>> fixed length with trailing blanks. :>> :>> I was thinking of simply doing a MOD64 (or higher) of the sum of the :>> :>> non-blank :>> :>> part of the string by words, but as this is text with a limited :>> character :>> :>> set, :>> :>> would this lead to excessive alias chains? Has research been done in :>> this :>> :>> area? There will mostly be search activity but also add activity. -- Binyamin Dissen http://www.dissensoftware.com Director, Dissen Software, Bar & Grill - Israel Should you use the mailblocks package and expect a response from me, you should preauthorize the dissensoftware.com domain. I very rarely bother responding to challenge/response systems, especially those from irresponsible companies. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html NOTICE: This electronic mail message and any files transmitted with it are intended exclusively for the individual or entity to which it is addressed. The message, together with any attachment, may contain confidential and/or privileged information. Any unauthorized review, use, printing, saving, copying, disclosure or distribution is strictly prohibited. If you have received this message in error, please immediately advise the sender by reply email and delete all copies. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: Hashing algorythm for text strings without embedded blanks
On Tue, 27 Jul 2010 15:43:47 -0400 zMan wrote: :>OK, I have a splitting headache and may be dense anyway: how is it easier to :>look it up once it's hashed? What am I missing? Not easier, faster. Assume 10 strings. A serial search will take on average 5 compares. A mod64 hash should reduce it on average to under 800 compares, hash to slot and run the alias chain. A b-tree would be better but then there is the issue of balancing it while others are busy transversing it. :>On Tue, Jul 27, 2010 at 3:36 PM, Binyamin Dissen > wrote: :>> On Tue, 27 Jul 2010 15:32:24 -0400 zMan wrote: :>> :>What's the goal of the hashing -- obfuscation? Shortening the data? :>> Lookup. :>> Does the entry exist, and if so what are its attributes. If does not exist, :>> add. :>> :>On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen < :>> bdis...@dissensoftware.com :>> :>> wrote: :>> :>> Is there a preferred hashing algorithm for such strings? The strings :>> will :>> :>> be :>> :>> fixed length with trailing blanks. :>> :>> I was thinking of simply doing a MOD64 (or higher) of the sum of the :>> :>> non-blank :>> :>> part of the string by words, but as this is text with a limited :>> character :>> :>> set, :>> :>> would this lead to excessive alias chains? Has research been done in :>> this :>> :>> area? There will mostly be search activity but also add activity. -- Binyamin Dissen http://www.dissensoftware.com Director, Dissen Software, Bar & Grill - Israel Should you use the mailblocks package and expect a response from me, you should preauthorize the dissensoftware.com domain. I very rarely bother responding to challenge/response systems, especially those from irresponsible companies. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: Hashing algorythm for text strings without embedded blanks
This is what I have used in past mainframe applications: http://www.vm.ibm.com/download/packages/descript.cgi?HASHWF /Tom Kern Binyamin Dissen wrote: > Is there a preferred hashing algorithm for such strings? The strings will be > fixed length with trailing blanks. > > I was thinking of simply doing a MOD64 (or higher) of the sum of the non-blank > part of the string by words, but as this is text with a limited character set, > would this lead to excessive alias chains? Has research been done in this > area? There will mostly be search activity but also add activity. > > -- > Binyamin Dissen > http://www.dissensoftware.com > > Director, Dissen Software, Bar & Grill - Israel > > > Should you use the mailblocks package and expect a response from me, > you should preauthorize the dissensoftware.com domain. > > I very rarely bother responding to challenge/response systems, > especially those from irresponsible companies. > > -- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO > Search the archives at http://bama.ua.edu/archives/ibm-main.html > -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: Hashing algorythm for text strings without embedded blanks
I have used the technique of adding byte values for a hash count in two shops over two decades to cut the length of a threaded list. Both times I used mod256 to good effect. The first time the string was an eight character variable name, the second time a set of strings over 64 bytes long. Both could contain spaces. In the first case the set was in the low thousands and short/long ratio was less than 3/1. In the second case, worst case set was above 1 million and the short long ratio was less than 1.5/1. In this case, the performance hit is low enough (and probably the systems fast enough) that it does not pay to go mod512. A million pages is still a technology limit for the kind of reports being indexed. One thing that helped in both cases was allocating control blocks pointed to by the hash tables in sets of pages to improve locality of reference. IBM Mainframe Discussion List wrote on 07/27/2010 03:18:49 PM: > From: Binyamin Dissen > Subject: Hashing algorythm for text strings without embedded blanks > > Is there a preferred hashing algorithm for such strings? The strings will be > fixed length with trailing blanks. > > I was thinking of simply doing a MOD64 (or higher) of the sum of thenon-blank > part of the string by words, but as this is text with a limited character set, > would this lead to excessive alias chains? Has research been done in this > area? There will mostly be search activity but also add activity. - The information contained in this communication (including any attachments hereto) is confidential and is intended solely for the personal and confidential use of the individual or entity to whom it is addressed. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this communication in error and that any review, dissemination, copying, or unauthorized use of this information, or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail, and delete the original message. Thank you -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: Hashing algorythm for text strings without embedded blanks
OK, I have a splitting headache and may be dense anyway: how is it easier to look it up once it's hashed? What am I missing? On Tue, Jul 27, 2010 at 3:36 PM, Binyamin Dissen wrote: > On Tue, 27 Jul 2010 15:32:24 -0400 zMan wrote: > > :>What's the goal of the hashing -- obfuscation? Shortening the data? > > Lookup. > > Does the entry exist, and if so what are its attributes. If does not exist, > add. > > :>On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen < > bdis...@dissensoftware.com > :>> wrote: > > :>> Is there a preferred hashing algorithm for such strings? The strings > will > :>> be > :>> fixed length with trailing blanks. > > :>> I was thinking of simply doing a MOD64 (or higher) of the sum of the > :>> non-blank > :>> part of the string by words, but as this is text with a limited > character > :>> set, > :>> would this lead to excessive alias chains? Has research been done in > this > :>> area? There will mostly be search activity but also add activity. > > -- > Binyamin Dissen > http://www.dissensoftware.com > > Director, Dissen Software, Bar & Grill - Israel > > > Should you use the mailblocks package and expect a response from me, > you should preauthorize the dissensoftware.com domain. > > I very rarely bother responding to challenge/response systems, > especially those from irresponsible companies. > > -- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO > Search the archives at http://bama.ua.edu/archives/ibm-main.html > -- zMan -- "I've got a mainframe and I'm not afraid to use it" -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: Hashing algorythm for text strings without embedded blanks
On Tue, 27 Jul 2010 15:32:24 -0400 zMan wrote: :>What's the goal of the hashing -- obfuscation? Shortening the data? Lookup. Does the entry exist, and if so what are its attributes. If does not exist, add. :>On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen > wrote: :>> Is there a preferred hashing algorithm for such strings? The strings will :>> be :>> fixed length with trailing blanks. :>> I was thinking of simply doing a MOD64 (or higher) of the sum of the :>> non-blank :>> part of the string by words, but as this is text with a limited character :>> set, :>> would this lead to excessive alias chains? Has research been done in this :>> area? There will mostly be search activity but also add activity. -- Binyamin Dissen http://www.dissensoftware.com Director, Dissen Software, Bar & Grill - Israel Should you use the mailblocks package and expect a response from me, you should preauthorize the dissensoftware.com domain. I very rarely bother responding to challenge/response systems, especially those from irresponsible companies. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: Hashing algorythm for text strings without embedded blanks
What's the goal of the hashing -- obfuscation? Shortening the data? On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen wrote: > Is there a preferred hashing algorithm for such strings? The strings will > be > fixed length with trailing blanks. > > I was thinking of simply doing a MOD64 (or higher) of the sum of the > non-blank > part of the string by words, but as this is text with a limited character > set, > would this lead to excessive alias chains? Has research been done in this > area? There will mostly be search activity but also add activity. > > -- > Binyamin Dissen > http://www.dissensoftware.com > > Director, Dissen Software, Bar & Grill - Israel > > > Should you use the mailblocks package and expect a response from me, > you should preauthorize the dissensoftware.com domain. > > I very rarely bother responding to challenge/response systems, > especially those from irresponsible companies. > > -- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO > Search the archives at http://bama.ua.edu/archives/ibm-main.html > -- zMan -- "I've got a mainframe and I'm not afraid to use it" -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Hashing algorythm for text strings without embedded blanks
Is there a preferred hashing algorithm for such strings? The strings will be fixed length with trailing blanks. I was thinking of simply doing a MOD64 (or higher) of the sum of the non-blank part of the string by words, but as this is text with a limited character set, would this lead to excessive alias chains? Has research been done in this area? There will mostly be search activity but also add activity. -- Binyamin Dissen http://www.dissensoftware.com Director, Dissen Software, Bar & Grill - Israel Should you use the mailblocks package and expect a response from me, you should preauthorize the dissensoftware.com domain. I very rarely bother responding to challenge/response systems, especially those from irresponsible companies. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html