Hashing algorythm for text strings without embedded blanks

2010-07-27 Thread Binyamin Dissen
Is there a preferred hashing algorithm for such strings? The strings will be
fixed length with trailing blanks.

I was thinking of simply doing a MOD64 (or higher) of the sum of the non-blank
part of the string by words, but as this is text with a limited character set,
would this lead to excessive alias chains? Has research been done in this
area? There will mostly be search activity but also add activity.

--
Binyamin Dissen bdis...@dissensoftware.com
http://www.dissensoftware.com

Director, Dissen Software, Bar  Grill - Israel


Should you use the mailblocks package and expect a response from me,
you should preauthorize the dissensoftware.com domain.

I very rarely bother responding to challenge/response systems,
especially those from irresponsible companies.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Hashing algorythm for text strings without embedded blanks

2010-07-27 Thread zMan
What's the goal of the hashing -- obfuscation? Shortening the data?

On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen bdis...@dissensoftware.com
 wrote:

 Is there a preferred hashing algorithm for such strings? The strings will
 be
 fixed length with trailing blanks.

 I was thinking of simply doing a MOD64 (or higher) of the sum of the
 non-blank
 part of the string by words, but as this is text with a limited character
 set,
 would this lead to excessive alias chains? Has research been done in this
 area? There will mostly be search activity but also add activity.

 --
 Binyamin Dissen bdis...@dissensoftware.com
 http://www.dissensoftware.com

 Director, Dissen Software, Bar  Grill - Israel


 Should you use the mailblocks package and expect a response from me,
 you should preauthorize the dissensoftware.com domain.

 I very rarely bother responding to challenge/response systems,
 especially those from irresponsible companies.

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
 Search the archives at http://bama.ua.edu/archives/ibm-main.html




-- 
zMan -- I've got a mainframe and I'm not afraid to use it

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Hashing algorythm for text strings without embedded blanks

2010-07-27 Thread Binyamin Dissen
On Tue, 27 Jul 2010 15:32:24 -0400 zMan zedgarhoo...@gmail.com wrote:

:What's the goal of the hashing -- obfuscation? Shortening the data?

Lookup. 

Does the entry exist, and if so what are its attributes. If does not exist,
add.

:On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen bdis...@dissensoftware.com
: wrote:

: Is there a preferred hashing algorithm for such strings? The strings will
: be
: fixed length with trailing blanks.

: I was thinking of simply doing a MOD64 (or higher) of the sum of the
: non-blank
: part of the string by words, but as this is text with a limited character
: set,
: would this lead to excessive alias chains? Has research been done in this
: area? There will mostly be search activity but also add activity.

--
Binyamin Dissen bdis...@dissensoftware.com
http://www.dissensoftware.com

Director, Dissen Software, Bar  Grill - Israel


Should you use the mailblocks package and expect a response from me,
you should preauthorize the dissensoftware.com domain.

I very rarely bother responding to challenge/response systems,
especially those from irresponsible companies.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Hashing algorythm for text strings without embedded blanks

2010-07-27 Thread zMan
OK, I have a splitting headache and may be dense anyway: how is it easier to
look it up once it's hashed? What am I missing?

On Tue, Jul 27, 2010 at 3:36 PM, Binyamin Dissen bdis...@dissensoftware.com
 wrote:

 On Tue, 27 Jul 2010 15:32:24 -0400 zMan zedgarhoo...@gmail.com wrote:

 :What's the goal of the hashing -- obfuscation? Shortening the data?

 Lookup.

 Does the entry exist, and if so what are its attributes. If does not exist,
 add.

 :On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen 
 bdis...@dissensoftware.com
 : wrote:

 : Is there a preferred hashing algorithm for such strings? The strings
 will
 : be
 : fixed length with trailing blanks.

 : I was thinking of simply doing a MOD64 (or higher) of the sum of the
 : non-blank
 : part of the string by words, but as this is text with a limited
 character
 : set,
 : would this lead to excessive alias chains? Has research been done in
 this
 : area? There will mostly be search activity but also add activity.

 --
 Binyamin Dissen bdis...@dissensoftware.com
 http://www.dissensoftware.com

 Director, Dissen Software, Bar  Grill - Israel


 Should you use the mailblocks package and expect a response from me,
 you should preauthorize the dissensoftware.com domain.

 I very rarely bother responding to challenge/response systems,
 especially those from irresponsible companies.

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
 Search the archives at http://bama.ua.edu/archives/ibm-main.html




-- 
zMan -- I've got a mainframe and I'm not afraid to use it

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Hashing algorythm for text strings without embedded blanks

2010-07-27 Thread Kirk Talman
I have used the technique of adding byte values for a hash count in two 
shops over two decades to cut the length of a threaded list.  Both times I 
used mod256 to good effect.

The first time the string was an eight character variable name, the second 
time a set of strings over 64 bytes long.  Both could contain spaces.

In the first case the set was in the low thousands and short/long ratio 
was less than 3/1.

In the second case, worst case set was above 1 million and the short long 
ratio was less than 1.5/1.  In this case, the performance hit is low 
enough (and probably the systems fast enough) that it does not pay to go 
mod512.  A million pages is still a technology limit for the kind of 
reports being indexed.

One thing that helped in both cases was allocating control blocks pointed 
to by the hash tables in sets of pages to improve locality of reference.

IBM Mainframe Discussion List IBM-MAIN@bama.ua.edu wrote on 07/27/2010 
03:18:49 PM:

 From: Binyamin Dissen bdis...@dissensoftware.com
 Subject: Hashing algorythm for text strings without embedded blanks
 
 Is there a preferred hashing algorithm for such strings? The strings 
will be
 fixed length with trailing blanks.
 
 I was thinking of simply doing a MOD64 (or higher) of the sum of 
thenon-blank
 part of the string by words, but as this is text with a limited 
character set,
 would this lead to excessive alias chains? Has research been done in 
this
 area? There will mostly be search activity but also add activity.


-
The information contained in this communication (including any
attachments hereto) is confidential and is intended solely for the
personal and confidential use of the individual or entity to whom
it is addressed. If the reader of this message is not the intended
recipient or an agent responsible for delivering it to the intended
recipient, you are hereby notified that you have received this
communication in error and that any review, dissemination, copying,
or unauthorized use of this information, or the taking of any
action in reliance on the contents of this information is strictly
prohibited. If you have received this communication in error,
please notify us immediately by e-mail, and delete the original
message. Thank you 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Hashing algorythm for text strings without embedded blanks

2010-07-27 Thread Thomas Kern
This is what I have used in past mainframe applications:

http://www.vm.ibm.com/download/packages/descript.cgi?HASHWF

/Tom Kern

Binyamin Dissen wrote:
 Is there a preferred hashing algorithm for such strings? The strings will be
 fixed length with trailing blanks.
 
 I was thinking of simply doing a MOD64 (or higher) of the sum of the non-blank
 part of the string by words, but as this is text with a limited character set,
 would this lead to excessive alias chains? Has research been done in this
 area? There will mostly be search activity but also add activity.
 
 --
 Binyamin Dissen bdis...@dissensoftware.com
 http://www.dissensoftware.com
 
 Director, Dissen Software, Bar  Grill - Israel
 
 
 Should you use the mailblocks package and expect a response from me,
 you should preauthorize the dissensoftware.com domain.
 
 I very rarely bother responding to challenge/response systems,
 especially those from irresponsible companies.
 
 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
 Search the archives at http://bama.ua.edu/archives/ibm-main.html
 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Hashing algorythm for text strings without embedded blanks

2010-07-27 Thread Binyamin Dissen
On Tue, 27 Jul 2010 15:43:47 -0400 zMan zedgarhoo...@gmail.com wrote:

:OK, I have a splitting headache and may be dense anyway: how is it easier to
:look it up once it's hashed? What am I missing?

Not easier, faster.

Assume 10 strings. A serial search will take on average 5 compares. A
mod64 hash should reduce it on average to under 800 compares, hash to slot and
run the alias chain.

A b-tree would be better but then there is the issue of balancing it while
others are busy transversing it.

:On Tue, Jul 27, 2010 at 3:36 PM, Binyamin Dissen bdis...@dissensoftware.com
: wrote:

: On Tue, 27 Jul 2010 15:32:24 -0400 zMan zedgarhoo...@gmail.com wrote:

: :What's the goal of the hashing -- obfuscation? Shortening the data?

: Lookup.

: Does the entry exist, and if so what are its attributes. If does not exist,
: add.

: :On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen 
: bdis...@dissensoftware.com
: : wrote:

: : Is there a preferred hashing algorithm for such strings? The strings
: will
: : be
: : fixed length with trailing blanks.

: : I was thinking of simply doing a MOD64 (or higher) of the sum of the
: : non-blank
: : part of the string by words, but as this is text with a limited
: character
: : set,
: : would this lead to excessive alias chains? Has research been done in
: this
: : area? There will mostly be search activity but also add activity.

--
Binyamin Dissen bdis...@dissensoftware.com
http://www.dissensoftware.com

Director, Dissen Software, Bar  Grill - Israel


Should you use the mailblocks package and expect a response from me,
you should preauthorize the dissensoftware.com domain.

I very rarely bother responding to challenge/response systems,
especially those from irresponsible companies.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Hashing algorythm for text strings without embedded blanks

2010-07-27 Thread Hal Merritt
Am I missing something? Why not use a binary chop search? That is, each compare 
cuts the universe in half.  For a million entries, you would need 20 or less 
compares. 

http://en.wikipedia.org/wiki/Binary_search_algorithm


-Original Message-
From: IBM Mainframe Discussion List [mailto:ibm-m...@bama.ua.edu] On Behalf Of 
Binyamin Dissen
Sent: Tuesday, July 27, 2010 3:14 PM
To: IBM-MAIN@bama.ua.edu
Subject: Re: Hashing algorythm for text strings without embedded blanks

On Tue, 27 Jul 2010 15:43:47 -0400 zMan zedgarhoo...@gmail.com wrote:

:OK, I have a splitting headache and may be dense anyway: how is it easier to
:look it up once it's hashed? What am I missing?

Not easier, faster.

Assume 10 strings. A serial search will take on average 5 compares. A
mod64 hash should reduce it on average to under 800 compares, hash to slot and
run the alias chain.

A b-tree would be better but then there is the issue of balancing it while
others are busy transversing it.

:On Tue, Jul 27, 2010 at 3:36 PM, Binyamin Dissen bdis...@dissensoftware.com
: wrote:

: On Tue, 27 Jul 2010 15:32:24 -0400 zMan zedgarhoo...@gmail.com wrote:

: :What's the goal of the hashing -- obfuscation? Shortening the data?

: Lookup.

: Does the entry exist, and if so what are its attributes. If does not exist,
: add.

: :On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen 
: bdis...@dissensoftware.com
: : wrote:

: : Is there a preferred hashing algorithm for such strings? The strings
: will
: : be
: : fixed length with trailing blanks.

: : I was thinking of simply doing a MOD64 (or higher) of the sum of the
: : non-blank
: : part of the string by words, but as this is text with a limited
: character
: : set,
: : would this lead to excessive alias chains? Has research been done in
: this
: : area? There will mostly be search activity but also add activity.

--
Binyamin Dissen bdis...@dissensoftware.com
http://www.dissensoftware.com

Director, Dissen Software, Bar  Grill - Israel


Should you use the mailblocks package and expect a response from me,
you should preauthorize the dissensoftware.com domain.

I very rarely bother responding to challenge/response systems,
especially those from irresponsible companies.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html
NOTICE: This electronic mail message and any files transmitted with it are 
intended
exclusively for the individual or entity to which it is addressed. The message, 
together with any attachment, may contain confidential and/or privileged 
information.
Any unauthorized review, use, printing, saving, copying, disclosure or 
distribution 
is strictly prohibited. If you have received this message in error, please 
immediately advise the sender by reply email and delete all copies.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Hashing algorythm for text strings without embedded blanks

2010-07-27 Thread Paul Gilmartin
On Tue, 27 Jul 2010 15:43:08 -0500, Hal Merritt wrote:

Am I missing something? Why not use a binary chop search? That is, each 
compare cuts the universe in half.  For a million entries, you would need 20 
or less compares.

http://en.wikipedia.org/wiki/Binary_search_algorithm

Insertion?

For John G.'s suggestion of CKSUM followed by modulo small
prime: Does primality matter?  Shouldn't  if the range of
CKSUM is spectrally neutral.

-- gil

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Hashing algorythm for text strings without embedded blanks

2010-07-27 Thread Binyamin Dissen
As I wrote, there is the issue of dynamically balancing the tree as random
insertions come in.

On Tue, 27 Jul 2010 15:43:08 -0500 Hal Merritt hmerr...@jackhenry.com wrote:

:Am I missing something? Why not use a binary chop search? That is, each 
compare cuts the universe in half.  For a million entries, you would need 20 or 
less compares. 
:
:http://en.wikipedia.org/wiki/Binary_search_algorithm
:
:
:-Original Message-
:From: IBM Mainframe Discussion List [mailto:ibm-m...@bama.ua.edu] On Behalf 
Of Binyamin Dissen
:Sent: Tuesday, July 27, 2010 3:14 PM
:To: IBM-MAIN@bama.ua.edu
:Subject: Re: Hashing algorythm for text strings without embedded blanks
:
:On Tue, 27 Jul 2010 15:43:47 -0400 zMan zedgarhoo...@gmail.com wrote:
:
::OK, I have a splitting headache and may be dense anyway: how is it easier to
::look it up once it's hashed? What am I missing?
:
:Not easier, faster.
:
:Assume 10 strings. A serial search will take on average 5 compares. A
:mod64 hash should reduce it on average to under 800 compares, hash to slot and
:run the alias chain.
:
:A b-tree would be better but then there is the issue of balancing it while
:others are busy transversing it.
:
::On Tue, Jul 27, 2010 at 3:36 PM, Binyamin Dissen bdis...@dissensoftware.com
:: wrote:
:
:: On Tue, 27 Jul 2010 15:32:24 -0400 zMan zedgarhoo...@gmail.com wrote:
:
:: :What's the goal of the hashing -- obfuscation? Shortening the data?
:
:: Lookup.
:
:: Does the entry exist, and if so what are its attributes. If does not 
exist,
:: add.
:
:: :On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen 
:: bdis...@dissensoftware.com
:: : wrote:
:
:: : Is there a preferred hashing algorithm for such strings? The strings
:: will
:: : be
:: : fixed length with trailing blanks.
:
:: : I was thinking of simply doing a MOD64 (or higher) of the sum of the
:: : non-blank
:: : part of the string by words, but as this is text with a limited
:: character
:: : set,
:: : would this lead to excessive alias chains? Has research been done in
:: this
:: : area? There will mostly be search activity but also add activity.

--
Binyamin Dissen bdis...@dissensoftware.com
http://www.dissensoftware.com

Director, Dissen Software, Bar  Grill - Israel


Should you use the mailblocks package and expect a response from me,
you should preauthorize the dissensoftware.com domain.

I very rarely bother responding to challenge/response systems,
especially those from irresponsible companies.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: Hashing algorythm for text strings without embedded blanks

2010-07-27 Thread Joel C. Ewing
Explanation of hashing functions and table look up:
  If you have a table with entries identified by a key, one of the
typical requirements is to locate a table entry having some given key
(or prove the entry doesn't exist).
  If the range of key values is suitably small so that all key values
may be uniquely mapped to a distinct numeric value with a modest range,
then the simplest solution may be an array using that unique value as a
direct index into the array.  But, in many cases the range of possible
key values makes that totally impossible --  e.g.,  an 8-byte,
unconstrained key could assume 2**64 possible values which would be
impractical for a directly-indexed, in-memory table.
  One solution in such cases is to instead use a hashing function to
transform the key domain to a relatively small, non-unique numeric range
that is only slightly larger (say 125%) than the maximum number of
distinct table entries to be stored, and to use a hash table sized to
match the hash function range.  The numeric hash value rather than the
actual key is used as a direct index into the table array.  Since the
hash function range is smaller than the key domain, multiple keys can
map to a single hash value, so the initial table entry located might be
for a different key.  To make look up work, each table entry must also
contain the actual key value for the entry, and there must be a rule (or
perhaps a pointer) that tells where to look next in the table for cases
where there is a collision when the hash function points to a used
location in the table for a different key.  A number of different
schemes are used for handling collisions, but basically you try other
table entry locations until you either find an empty location or one
with the right key.
  The beauty of this scheme is that with a suitable hash function,
suitably distributed key values, suitable collision handling rules, and
a table that is guaranteed to never become over 80% full, it can be
proved that the average number of key comparisons to locate an entry
will never exceed around 2.5. It doesn't matter how big the table is as
long as the 80% rule is not violated.  A binary search requires an
average of 2.4 key comparisons to locate all entries in a  7-entry
table,  3.2 comparisons for a 15-entry table, and the average increases
as the number of entries increases.  Even with a less than optimal
hashing function and a less than optimal distribution of key values it
is easy to see how hash table lookup can run circles around binary
search for repeated random lookups as the total number of entries
becomes large.
  The disadvantage is that if at the end of processing you want to list
all table entries in any semblance of order, a sorting process will be
required; and depending on implementation technique, you may even have
to hunt through a sparsely filled table just to find all the entries.
   Joel C Ewing

On 07/27/2010 02:43 PM, zMan wrote:
 OK, I have a splitting headache and may be dense anyway: how is it easier to
 look it up once it's hashed? What am I missing?
 
 On Tue, Jul 27, 2010 at 3:36 PM, Binyamin Dissen bdis...@dissensoftware.com
 wrote:
 
 On Tue, 27 Jul 2010 15:32:24 -0400 zMan zedgarhoo...@gmail.com wrote:

 :What's the goal of the hashing -- obfuscation? Shortening the data?

 Lookup.

 Does the entry exist, and if so what are its attributes. If does not exist,
 add.

 :On Tue, Jul 27, 2010 at 3:18 PM, Binyamin Dissen 
 bdis...@dissensoftware.com
 : wrote:

 : Is there a preferred hashing algorithm for such strings? The strings
 will
 : be
 : fixed length with trailing blanks.

 : I was thinking of simply doing a MOD64 (or higher) of the sum of the
 : non-blank
 : part of the string by words, but as this is text with a limited
 character
 : set,
 : would this lead to excessive alias chains? Has research been done in
 this
 : area? There will mostly be search activity but also add activity.

 --
 Binyamin Dissen bdis...@dissensoftware.com
 http://www.dissensoftware.com

 Director, Dissen Software, Bar  Grill - Israel


 Should you use the mailblocks package and expect a response from me,
 you should preauthorize the dissensoftware.com domain.

 I very rarely bother responding to challenge/response systems,
 especially those from irresponsible companies.
...

-- 
Joel C. Ewing, Fort Smith, ARjcew...@acm.org

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html