Re: [agi] 40 years of parsing NL...

Matt Mahoney Fri, 22 Mar 2013 17:53:50 -0700

On Fri, Mar 22, 2013 at 6:30 PM, Steve Richfield
<[email protected]> wrote:
> I believe that my approach will be fast enough to keep up with the Internet, 
> and I haven't seen any other approach that promises such blinding speed. In 
> theory, all I need do is get the word out, and wait for folks at Google, 
> Yahoo, and Facebook to discover it, which is my present plan.


Blinding speed remains to be seen.

You are probably aware that most language models already use the
technique of converting words to ordinals. When speech recognition was
a hot research topic in the 1990's, a common language modeling
technique would be to convert the most common 20K words to numbers and
construct various models to compute word probabilities using e.g.
trigram and sparse bigram models. These probabilities are combined
with the distribution from the acoustic model to select the most
likely next word. The language models would also be evaluated
independently by measuring perplexity, which is really the same thing
as compression ratio. Better models compress smaller, and this
correlates strongly with lower word error rates.

Converting words to ordinals is also used in the number 1, 2, 4, 7,
and 9 compressors in http://mattmahoney.net/dc/text.html
The technique is to read a dictionary (a list of words), construct a
hash table index in memory, then convert the input text to a sequence
of ordinals (1, 2, or 3 byte numbers depending on word frequency),
plus symbols for punctuation, letters for spelling rare unmatched
words, and symbols to denote when the word is capitalized or all upper
case. The parsers use simple ad-hoc rules similar to the ones
described in your patent. Then the sequence of ordinals is compressed,
usually one bit at a time using a mix of hundreds of context models.
The top ranked program (durilca'kingsize) uses PPM, or 1 byte at a
time prediction, and wins only because it uses 13 GB of memory. The
second place program (paq8hp12any) is the winner of the Hutter prize,
which limits memory to 1 GB.

In all cases, the dictionary is derived from the test data itself.
(The evaluation includes the decompression program and compressed
dictionary size, so it is not cheating to do this). For 7'th ranked
xwrt, this is done dynamically. In the other cases, the dictionary was
prepared in advance from the test data and organized both to improve
its compressibility as well as the compressibility of the tokenized
text. The first method involves grouping words with similar spelling,
such as by sorting reversed words to group suffixes. The second method
involves grouping words that are related semantically and
grammatically, like "mother" with "father" and "monday" with
"tuesday". For paq8hp12any and lpaq9m (both use the same
dictionaries), a lot of the organization was done manually with the
help of some utilities. For durilca'kingsize, the author didn't go
into details but I believe he used a program that clustered the words
in context space. Either way, this grouping allows for contexts that
drop the low order bits of the ordinals, effectively making the group
itself a context.

Anyway, I would not be surprised if Google, Bing, Facebook, etc. are
already using similar techniques in their language models. You might
actually want to build something before making bold claims. The part
you glossed over - building a set of rules describing the language -
might not be as easy as you think. A couple of benchmarks. Watson was
a 30 person-year effort, and most of the language rules were learned
from 4 TB of text rather than hand coded. It runs on a few thousand
CPU cores. Cyc has invested hundreds of person-years since 1984 and
they have absolutely no idea how much more work they need to do.

--
-- Matt Mahoney, [email protected]


-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Re: [agi] 40 years of parsing NL...

Reply via email to