I did a small test comparing your floating point hash function with a
similar integer hash function like those I typically use in data
compression programs to compute word-order contexts. The program parses
input text into words, converts to lower case, and computes a hash. In
order to test uniformity, I did no collision detection. Rather, it sets a
hash table entry (1 byte) to 1 at the hash index. When finished, it prints
a count of the number of hash entries set. This will be less than the true
vocabulary size due to collisions. Higher numbers indicate a more uniform
distribution.
I used a hash table size of N = 1 MiB (2^20). For the integer hash, I
multiply by a randomly chosen odd number (M = 314159265) and add the next
byte of the string converted to a number in the range 1..26 corresponding
to a..z (not case sensitive). For the floating point hash I used M = pi/4.
To compute the hash index, I take the low 20 bits for the integer hash. For
the floating point hash, I multiply by a large prime number and truncate to
an integer as suggested in your patent application. To stay in range, this
number cannot be larger than N*(1-M)/27 = 8334. The largest such prime is
8329. (It is possible to convert the letter to 0..25 and use N*(1-M)/26,
but this gives a worse distribution for both the integer and floating point
hash). As a control, I also read the lowercase words into a map<string,
int> (actually a RB-tree) as strings to get an exact vocabulary count. I
tested on enwik8 (100 MB text) and enwik9 (1 GB text) from
http://mattmahoney.net/dc/text.html on a 2.0 GHz T3200 under 32 bit Vista,
compiled with MinGW g++ 4.7.0 -O3. Vocabulary counts and run times are as
follows:
enwik8
map: 288,920, 21.75 sec (true count)
int: 251,936, 2.37 sec
float: 201,504, 2.68 sec
enwik9
map: 1,171,106, 258.69 sec
int: 703,954, 26.72 sec
float: 389,345, 29.87 sec
The test results show that converting words to ordinals is indeed an order
of magnitude faster than working with strings. Of course this is without
collision detection. To detect collisions, you still have to compare the
original strings.
The poor distribution of the floating point hash could probably be fixed by
multiplying by a larger number (it doesn't need to be prime) and using only
the low bits. However, it is still a little slower. On modern processors,
floating point arithmetic is as fast as integer arithmetic, but there are
the extra steps of converting characters to floating point and converting
hash indexes back to integers.
The integer hash function uses arithmetic overflow. Unsigned (but not
signed) overflow has well defined behavior in C and C++. In fact, it is
routinely used in FIPS certified cryptographic code.
Compile the attached code with -DFP to use the floating point hash, or
-DMAP to use the map. The default is to use the integer hash. To run,
redirect input to a text file. It will print the number of hash table
entries (vocabulary size detected), run time, maximum hash index (should be
slightly less than 1 MB) and hash table size (1 MB).
Also, I didn't see the patent claims in your application. Did you forget to
attach them?
On Tue, Mar 19, 2013 at 2:20 PM, Steve Richfield
<[email protected]>wrote:
> Matt,
>
> On Mon, Mar 18, 2013 at 6:03 PM, Matt Mahoney <[email protected]>wrote:
>
>> On Mon, Mar 18, 2013 at 4:29 PM, Steve Richfield
>>
>
>
>> As to your hash function, I don't see why this should be any faster
>> than integer arithmetic.
>>
> ...
>
>> And no, you do not get overflow errors with integer arithmetic. The
>> result is just truncated, which for many hash functions is actually
>> what you want.
>>
>
> After some research into the above, I will more precisely state the
> situation with integer hashing and await any clarification from you:
>
> Intel hardware does NOT throw exceptions to integer overflows, but
> provides for optional outboard tests to check for these overflows. Other
> "big iron", like IBM mainframes, DO throw exceptions to integer overflows,
> because they handle money, which is MUCH more valuable than bits.
>
> Some languages like Visual Basic add checks for integer overflows by
> default, but this test can be optionally disabled for ALL operations. Some
> languages like Java check for NO integer overflows. Some languages like C
> allow overflows on unsigned integers, while their operation is undefined
> for signed integer overflows. People writing in assembly can do anything
> they want, but must then pay for the overhead to get to and from their
> assembly coded routines, e.g. via DLL linkage.
>
> In the bad old days of individually compiling routines and linking them
> together, it was a trivial matter to specially compile one routine, or
> write a particular routine in a different language. Now, you must do
> something special like putting a routine into a DLL to get special
> treatment for a particular routine.
>
> Integer wraparound is one the primary sources for the myriad updates we
> all get from Microsoft to plug security holes. Add some assholes who try to
> crash things (I get these all the time being fed into DrEliza.com) and life
> can get pretty difficult.
>
> I wouldn't think of attempting to write and maintain AI code in a language
> that isn't HIGHLY checked, especially for all matters pertaining to
> subscribing. Unfortunately, having integer hashing code ANYWHERE in the
> program means that, regardless of the present-day platform chosen, that
> subscript computations must go unchecked EVERYWHERE in the program.
>
> Java covers this apparent weakness by allowing integer computations to go
> unchecked, but then checking all subscripts prior to use, to make sure that
> the code isn't stepping on something besides the array being addressed.
> This avoids clobbering memory, but does NOT necessarily guarantee that
> there was no wraparound in the computations that arrived at a valid though
> possibly incorrect subscript.
>
> So, this leaves the following choices:
> 1. A developer chooses an Intel processor and a language that doesn't
> check for integer overflows, or chooses to turn integer overflow checking
> off in a language like Visual Basic where they can be disabled. In the
> process they leave themselves wide open for all of the text on the Internet
> hitting the most complex AI code ever written, of wrapping an integers
> around SOMEWHERE in the code to cause problems, or
> 2. They go ahead and use a slower integer method that survives overflow
> checking, with an eye to later replacing it with faster code, once the
> program works well enough to safely disable such checking. Of course,
> floating point methods would eliminate this step, so why bother?
> 2. They use floating point methods that work in the presence of FULL
> error checking.
> 3. Maybe future compiler writers will provide a method to disable error
> checking for specific statements, which would provide the ability to use
> integer hashing without sacrificing any error checking.
>
> So, as I see it (and commented in my previous posting) you are technically
> correct, but integer hashing isn't worth its non-speed costs, like the
> reduced reliability of all of the OTHER code.
>
> I have written thousands of pages of ugly AI code, and I wouldn't dream of
> turning off ANY available error checking. I prefer using Visual Basic, only
> for its superb error checking that can be selectively disabled once the
> program has been fully debugged, as other languages are MUCH more powerful.
>
> Further, there is a special problem in debugging NLP that makes debugging
> subtle problems a MAJOR challenge to be avoided at all costs - "heidenbugs"
> - where the program is working correctly, e.g. correctly picking up on
> something "between the lines", that isn't at all obvious to human readers.
> It is EVER so easy to end up chasing a bug that simply isn't there. Of
> course, it is hardest of all to find something that isn't there. The
> majority of my time during the tail-end of debugging DrEliza.com was spent
> chasing heidenbugs, and I suspect that with >100 times as many rules and
> the entire Interrnet to analyze, that heidenbugs would completely swamp the
> debugging of all other problems, combined. Now, sprinkle in some wraparound
> and other such subtle problems, and you would NEVER get the thing fully
> debugged. The presence of heidenbugs tend to amplify the cost of finding
> the real bugs by an order of magnitude or so.
>
> Heidenbugs are SUCH a complication, because until you figure them out,
> they are indistinguishable from crazy things like wraparound. At some point
> you notice that, e.g., the last dozen or so bugs you have chased down were
> all heidenbugs, so you declare it "working", residual bugs and all. Ugly,
> but what other choice is there?
>
> Steve
>
> *AGI* | Archives <https://www.listbox.com/member/archive/303/=now>
> <https://www.listbox.com/member/archive/rss/303/3701026-786a0853> |
> Modify<https://www.listbox.com/member/?&>Your Subscription
> <http://www.listbox.com>
>
--
-- Matt Mahoney, [email protected]
-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription:
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com
// Estimate vocabulary size of a text file from stdin.
// To use a map of strings for an exact count, compile with -DMAP
// Or to use a floating point hash table, compile with -DFP
// Default is to use an integer hash table. Either hash table will undercount
// due to hash collisions.
// Written by Matt Mahoney. Public domain.
#include <stdio.h>
#include <math.h>
#include <ctype.h>
#include <time.h>
#include <map>
#include <string>
const int BUFSIZE=4096; // input buffer size
const unsigned N=1048576; // hash table size = 2^20
#ifdef FP
double h=0; // floating point hash
const double M=atan(1); // multiplier = pi/4
#else // integer hash
unsigned h=0; // integer hash
const unsigned M=314159265u; // multiplier = a random odd 32 bit number
#endif
int main() {
static char buf[BUFSIZE]; // input buffer
int n; // input buffer actual size
static char v[N]={0}; // hash table, 1 if word occurs
int count=0; // vocabulary size
int maxi=0; // largest hash index found
std::string s; // input word (-DMAP only)
std::map<std::string, int> m; // tree of strings (-DMAP only)
clock_t start=clock();
while ((n=fread(buf, 1, BUFSIZE, stdin))>0) {
for (int i=0; i<n; ++i) {
int c=buf[i];
if (isalpha(c)) {
#ifdef MAP
s+=tolower(c);
h=1;
#else
h=h*M+tolower(c)-('a'-1);
#endif
}
else if (h!=0) { // end of a word?
#ifdef MAP
if (++m[s]==1)
++count;
s="";
#else
#ifdef FP
int i=int(h*8329); // highest prime under N*(1-M)/27
#else
int i=h&(N-1);
#endif
if (!v[i]) {
v[i]=1;
++count;
if (i>maxi) maxi=i;
}
#endif
h=0;
}
}
}
printf("%d words found in %1.2f seconds. Max index %d of %d\n",
count, double(clock()-start)/CLOCKS_PER_SEC, maxi, N);
return 0;
}