Matt Mahoney
Tue, 15 Aug 2006 09:28:53 -0700
To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]----- Original Message -----From: Matt MahoneySent: Tuesday, August 15, 2006 12:55 AMSubject: Re: Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prizeWhere will the knowledge to compress text come from? There are 3 possibilities.To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
1. externally supplied, like the lexical models (dictionaries) for paq8h and WinRK.
2. learned from the input in a separate pass, like xml-wrt|ppmonstr.
3. learned online in one pass, like paq8f and slim.
These all have the same effect on compressed size. In the first case, you increase the size of the decompressor. In the second, you have to append the model you learned from the first pass to the compressed file so it is available to the decompressor. In the third case, compression is poor at the beginning. From the viewpoint of information theory, there is no difference in these three approaches. The penalty is the same.
To improve compression further, you will need to model semantics and/or syntax. No compressor currently does this. I think the reason is that it is not worthwhile unless you have hundreds of megabytes of natural language text. In fact, only the top few compressors even have lexical models. All the rest are byte oriented n-gram models.
A semantic model would know what words are related, like "star" and "moon". It would learn this by their tendency to appear together. You can build a dictionary of such knowledge from the data set itself or you can build it some other way (such as Wordnet) and include it in the decompressor. If you learn it from the input, you could do it in a separate pass (like LSA) or you could do it in one pass (maybe an equivalent neural network) so that you build the model as you compress.
To learn syntax, you can cluster words by similarity of their immediate context. These clusters correspond to part of speech. For instance, "the X is" tells you that X is a noun. You can model simple grammars as n-grams over their classifications, such as (Art Noun Verb). Again, you can use any of 3 approaches.
Learning semantics and syntax is a hard problem, but I think you can see it can be done with statistical modeling. The training data you need is in the input itself.
I don't see any point in this debate over lossless vs. lossy compression. You have to solve the language learning problem in either case to improve compression. I think it will be more productive to discuss how this can be done.
-- Matt Mahoney, [EMAIL PROTECTED]