Matt, ****THANKS*** for the test. * Continuing... On Tue, Mar 19, 2013 at 1:39 PM, Matt Mahoney <[email protected]>wrote:
> I did a small test comparing your floating point hash function with a > similar integer hash function like those I typically use in data > compression programs to compute word-order contexts. The program parses > input text into words, converts to lower case, and computes a hash. In > order to test uniformity, I did no collision detection. Rather, it sets a > hash table entry (1 byte) to 1 at the hash index. When finished, it prints > a count of the number of hash entries set. This will be less than the true > vocabulary size due to collisions. Higher numbers indicate a more uniform > distribution. > > I used a hash table size of N = 1 MiB (2^20). For the integer hash, I > multiply by a randomly chosen odd number (M = 314159265) and add the next > byte of the string converted to a number in the range 1..26 corresponding > to a..z (not case sensitive). For the floating point hash I used M = pi/4. > To compute the hash index, I take the low 20 bits for the integer hash. For > the floating point hash, I multiply by a large prime number and truncate to > an integer as suggested in your patent application. To stay in range, this > number cannot be larger than N*(1-M)/27 = 8334. The largest such prime is > 8329. (It is possible to convert the letter to 0..25 and use N*(1-M)/26, > but this gives a worse distribution for both the integer and floating point > hash). As a control, I also read the lowercase words into a map<string, > int> (actually a RB-tree) as strings to get an exact vocabulary count. I > tested on enwik8 (100 MB text) and enwik9 (1 GB text) from > http://mattmahoney.net/dc/text.html on a 2.0 GHz T3200 under 32 bit > Vista, compiled with MinGW g++ 4.7.0 -O3. Vocabulary counts and run times > are as follows: > > enwik8 > map: 288,920, 21.75 sec (true count) > int: 251,936, 2.37 sec > float: 201,504, 2.68 sec > > enwik9 > map: 1,171,106, 258.69 sec > int: 703,954, 26.72 sec > float: 389,345, 29.87 sec > > The test results show that converting words to ordinals is indeed an order > of magnitude faster than working with strings. Of course this is without > collision detection. To detect collisions, you still have to compare the > original strings. > *NO*!!!!!! Just compare the 64-bit hash, either as an integer or as a double precision floating point value. Even in floating point, (integer is better here) the chance of having ANY collisions, where different words have exactly the same hash, is about one in a million. Even if there is a collision, it will only make synonymous two unusual and unrelated words. It isn't perfect, but it IS definitely good enough - the criteria I used for lots of this. > > The poor distribution of the floating point hash could probably be fixed > by multiplying by a larger number (it doesn't need to be prime) and using > only the low bits. > I don't understand. What is the problem multiplying by a really large number in floating point? > However, it is still a little slower. On modern processors, floating point > arithmetic is as fast as integer arithmetic, but there are the extra steps > of converting characters to floating point and converting hash indexes back > to integers. > Right. > > The integer hash function uses arithmetic overflow. Unsigned (but not > signed) overflow has well defined behavior in C and C++. In fact, it is > routinely used in FIPS certified cryptographic code. > > Compile the attached code with -DFP to use the floating point hash, or > -DMAP to use the map. The default is to use the integer hash. To run, > redirect input to a text file. It will print the number of hash table > entries (vocabulary size detected), run time, maximum hash index (should be > slightly less than 1 MB) and hash table size (1 MB). > > Also, I didn't see the patent claims in your application. Did you forget > to attach them? > No. This is a complex legal issue. At considerable risk of not getting this perfectly right... I/we (including my attorney, who is working for a piece of the action) have a not-particularly-unusual strategy: Claims can be amended later without incurring additional cost. So, I am putting this out there for everyone, most especially you, to kick the tires. Then when the patent office first "wakes up" and sends something - probably lots of objections, we will amend the claims. You have clearly shown that floating point hashing has no reasonable place in any independent claims, so when the claims finally see the light of day, it will be buried down in some dependent claims. To cover our butts (since the entire patent system changed 4 days ago, and the old rules are MUCH better than the new rules) there are LOTS of claims to amend. As filed, it has 5 independent claims and 30 total claims. At some point we will probably be forced to have the US patent office publish the claims that they then have in their hands, as there would be no way to enforce or negotiate sale without having done so. Of course, if you are willing to sign an NDA, I would be more than glad to put you in the loop, send you the present claims, etc. If you are considering filing your own patents, this might not be a good idea. I was especially hoping you would comment on some particular subjects: 1. My three-pronged approach of producing unsolicited responses, customizing advertisements (which will probably be called something else when they accurately target a posting), and having a web site portal into the system, so that each manner of "entry" could be used where it worked best, e.g. blogs would probably use the advertising model. As I understand your approach, which is to integrate the AI into the affairs of civilization, akin to the Colossus trilogy, it would be a harder sell at the beginning, and require greater investment, but would probably work better at the end. Unfortunately, we live in a world where it is difficult to get things started, so I chose the easier/poorer path. So, don't burn your proposal quite yet - as it may be just what the world needs in another decade or so. 2. My business model for promoting the technology - by granting generous monetary credit toward royalties in return for participation in developing the technology. Done right, only the likes of Google, Yahoo, and Facebook would support the entire effort. 3. Anything else you see is right, wrong, or needs improvement? Other than me, you are the only one who has invested significant effort into proposing a specific intelligent Internet. You should get over your prejudices (e.g. against DrEliza.com), understand my proposal (that is only distantly related to DrEliza.com), and comment on it in ways that it can guide its continuing improvement, to succeed. ****THANKS*** again for the testing. * Steve ========================== > On Tue, Mar 19, 2013 at 2:20 PM, Steve Richfield < > [email protected]> wrote: > >> Matt, >> >> On Mon, Mar 18, 2013 at 6:03 PM, Matt Mahoney <[email protected]>wrote: >> >>> On Mon, Mar 18, 2013 at 4:29 PM, Steve Richfield >>> >> >> >>> As to your hash function, I don't see why this should be any faster >>> than integer arithmetic. >>> >> ... >> >>> And no, you do not get overflow errors with integer arithmetic. The >>> result is just truncated, which for many hash functions is actually >>> what you want. >>> >> >> After some research into the above, I will more precisely state the >> situation with integer hashing and await any clarification from you: >> >> Intel hardware does NOT throw exceptions to integer overflows, but >> provides for optional outboard tests to check for these overflows. Other >> "big iron", like IBM mainframes, DO throw exceptions to integer overflows, >> because they handle money, which is MUCH more valuable than bits. >> >> Some languages like Visual Basic add checks for integer overflows by >> default, but this test can be optionally disabled for ALL operations. Some >> languages like Java check for NO integer overflows. Some languages like C >> allow overflows on unsigned integers, while their operation is undefined >> for signed integer overflows. People writing in assembly can do anything >> they want, but must then pay for the overhead to get to and from their >> assembly coded routines, e.g. via DLL linkage. >> >> In the bad old days of individually compiling routines and linking them >> together, it was a trivial matter to specially compile one routine, or >> write a particular routine in a different language. Now, you must do >> something special like putting a routine into a DLL to get special >> treatment for a particular routine. >> >> Integer wraparound is one the primary sources for the myriad updates we >> all get from Microsoft to plug security holes. Add some assholes who try to >> crash things (I get these all the time being fed into DrEliza.com) and life >> can get pretty difficult. >> >> I wouldn't think of attempting to write and maintain AI code in a >> language that isn't HIGHLY checked, especially for all matters pertaining >> to subscribing. Unfortunately, having integer hashing code ANYWHERE in the >> program means that, regardless of the present-day platform chosen, that >> subscript computations must go unchecked EVERYWHERE in the program. >> >> Java covers this apparent weakness by allowing integer computations to go >> unchecked, but then checking all subscripts prior to use, to make sure that >> the code isn't stepping on something besides the array being addressed. >> This avoids clobbering memory, but does NOT necessarily guarantee that >> there was no wraparound in the computations that arrived at a valid though >> possibly incorrect subscript. >> >> So, this leaves the following choices: >> 1. A developer chooses an Intel processor and a language that doesn't >> check for integer overflows, or chooses to turn integer overflow checking >> off in a language like Visual Basic where they can be disabled. In the >> process they leave themselves wide open for all of the text on the Internet >> hitting the most complex AI code ever written, of wrapping an integers >> around SOMEWHERE in the code to cause problems, or >> 2. They go ahead and use a slower integer method that survives overflow >> checking, with an eye to later replacing it with faster code, once the >> program works well enough to safely disable such checking. Of course, >> floating point methods would eliminate this step, so why bother? >> 2. They use floating point methods that work in the presence of FULL >> error checking. >> 3. Maybe future compiler writers will provide a method to disable error >> checking for specific statements, which would provide the ability to use >> integer hashing without sacrificing any error checking. >> >> So, as I see it (and commented in my previous posting) you are >> technically correct, but integer hashing isn't worth its non-speed costs, >> like the reduced reliability of all of the OTHER code. >> >> I have written thousands of pages of ugly AI code, and I wouldn't dream >> of turning off ANY available error checking. I prefer using Visual Basic, >> only for its superb error checking that can be selectively disabled once >> the program has been fully debugged, as other languages are MUCH more >> powerful. >> >> Further, there is a special problem in debugging NLP that makes debugging >> subtle problems a MAJOR challenge to be avoided at all costs - "heidenbugs" >> - where the program is working correctly, e.g. correctly picking up on >> something "between the lines", that isn't at all obvious to human readers. >> It is EVER so easy to end up chasing a bug that simply isn't there. Of >> course, it is hardest of all to find something that isn't there. The >> majority of my time during the tail-end of debugging DrEliza.com was spent >> chasing heidenbugs, and I suspect that with >100 times as many rules and >> the entire Interrnet to analyze, that heidenbugs would completely swamp the >> debugging of all other problems, combined. Now, sprinkle in some wraparound >> and other such subtle problems, and you would NEVER get the thing fully >> debugged. The presence of heidenbugs tend to amplify the cost of finding >> the real bugs by an order of magnitude or so. >> >> Heidenbugs are SUCH a complication, because until you figure them out, >> they are indistinguishable from crazy things like wraparound. At some point >> you notice that, e.g., the last dozen or so bugs you have chased down were >> all heidenbugs, so you declare it "working", residual bugs and all. Ugly, >> but what other choice is there? >> >> Steve >> >> *AGI* | Archives <https://www.listbox.com/member/archive/303/=now> >> <https://www.listbox.com/member/archive/rss/303/3701026-786a0853> | >> Modify <https://www.listbox.com/member/?&> Your Subscription >> <http://www.listbox.com> >> > > > > -- > -- Matt Mahoney, [email protected] > *AGI* | Archives <https://www.listbox.com/member/archive/303/=now> > <https://www.listbox.com/member/archive/rss/303/10443978-6f4c28ac> | > Modify<https://www.listbox.com/member/?&>Your Subscription > <http://www.listbox.com> > -- Full employment can be had with the stoke of a pen. Simply institute a six hour workday. That will easily create enough new jobs to bring back full employment. ------------------------------------------- AGI Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424 Modify Your Subscription: https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657 Powered by Listbox: http://www.listbox.com
