--- "J Storrs Hall, PhD" <[EMAIL PROTECTED]> wrote: > http://science.slashdot.org/article.pl?sid=07/07/10/0055257 > > (The Slashdot article includes a pointer to Matt Mahoney's thesis :-) > > Josh
I think we're a bit further than 1% away from AI. All we can really say is that we are 1% from compressing a particular 100 MB text file to 1.3 bpc, the upper bound estimated by Shannon. The article does point to some work I did on my dissetation which estimates the entropy of written English to be around 1.0 to 1.1 bpc, but of course it depends on the text source. I never actually measured it for enwik8 or enwik9 using tests like those of Shannon or Cover and King. To do that I would probably use something like human-assisted text compression, where the program would estimate a probability distribution for the next character and a human would refine the distribution before the prediction was revealed. In any case, we have already passed 1.3 bpc on enwik9 (1 GB at 1.056 bpc) at http://cs.fit.edu/~mmahoney/compression/text.html I don't know what the entropy would be using human assisted compression, perhaps 0.8 to 0.9 bpc. There is about 30% nontext content like XML and XHTML markup, table formatting, articles generated from census tables, etc. This tends to be more compressible. Also, compressing 1 GB is more resource intensive than the 100 MB for the Hutter prize. I think a 5-10% improvement might be possible just by using existing algorithms on computer with 16-32 GB of memory and a few days of CPU. I had some extensive discussions with Marcus Hutter in setting up the contest last year. I wanted to use 1 GB of text because it is the amount of language that the average human is exposed to by adulthood. Also I wanted no limits on CPU or memory, since I believe that current computers are inadequate for AI. The decompressor size has to be included, but I was very lenient about what constitutes a decompressor, either source or executable in any language packaged into a zip file. Hutter wanted stricter rules, which I can understand because it's his prize money. In any case, that is why there are two separate but similar benchmarks. My main objection is using a 100 MB data set, which is equivalent to the language model of a 2-3 year old child. -- Matt Mahoney, [EMAIL PROTECTED] ----- This list is sponsored by AGIRI: http://www.agiri.org/email To unsubscribe or change your options, please go to: http://v2.listbox.com/member/?member_id=231415&id_secret=17460826-32e600
