--- "J Storrs Hall, PhD" <[EMAIL PROTECTED]> wrote:

> http://science.slashdot.org/article.pl?sid=07/07/10/0055257
> 
> (The Slashdot article includes a pointer to Matt Mahoney's thesis :-)
> 
> Josh

I think we're a bit further than 1% away from AI.  All we can really say is
that we are 1% from compressing a particular 100 MB text file to 1.3 bpc, the
upper bound estimated by Shannon.  The article does point to some work I did
on my dissetation which estimates the entropy of written English to be around
1.0 to 1.1 bpc, but of course it depends on the text source.  I never actually
measured it for enwik8 or enwik9 using tests like those of Shannon or Cover
and King.  To do that I would probably use something like human-assisted text
compression, where the program would estimate a probability distribution for
the next character and a human would refine the distribution before the
prediction was revealed.

In any case, we have already passed 1.3 bpc on enwik9 (1 GB at 1.056 bpc) at
http://cs.fit.edu/~mmahoney/compression/text.html  I don't know what the
entropy would be using human assisted compression, perhaps 0.8 to 0.9 bpc. 
There is about 30% nontext content like XML and XHTML markup, table
formatting, articles generated from census tables, etc.  This tends to be more
compressible.  Also, compressing 1 GB is more resource intensive than the 100
MB for the Hutter prize.  I think a 5-10% improvement might be possible just
by using existing algorithms on computer with 16-32 GB of memory and a few
days of CPU.

I had some extensive discussions with Marcus Hutter in setting up the contest
last year.  I wanted to use 1 GB of text because it is the amount of language
that the average human is exposed to by adulthood.  Also I wanted no limits on
CPU or memory, since I believe that current computers are inadequate for AI. 
The decompressor size has to be included, but I was very lenient about what
constitutes a decompressor, either source or executable in any language
packaged into a zip file.  Hutter wanted stricter rules, which I can
understand because it's his prize money.  In any case, that is why there are
two separate but similar benchmarks.  My main objection is using a 100 MB data
set, which is equivalent to the language model of a 2-3 year old child.



-- Matt Mahoney, [EMAIL PROTECTED]

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?member_id=231415&id_secret=17460826-32e600

Reply via email to