--- Mark Waser <[EMAIL PROTECTED]> wrote: > > Can you argue > > that the representation is at least half of the information? > > Yes, I can. Take any case that involves a unit of measurement. The > statements "John is x inches tall", "John is y centimeters tall", "John is > a.b feet tall", "John is c.d meters tall", "John is 1/e miles tall" can be > reproduced nearly ad infinitum.
But people don't normally speak that way. My guess is that 80% to 90% of the information content of normal written English is meaningful, and 10% to 20% is choice of representation. Take a random string of text of length n characters and ask how many ways it can be rewritten without changing the meaning. My guess is around 2^0.1n to 2^0.2n. For a parser plus knowledge base plus text generator like you described (assuming one could be built that understands the majority of English sentences), then a lossy test would be appropriate. Encode and decode a random sentence and ask a judge if it means the same thing. You could evaluate the system by the size of the compressed database. If your system can recognize the same fact phrased in two different ways, then the second input would not increase the database size at all. Of course this test is subjective. One judge may say that "John is 6 feet tall" and "John is 182 cm tall" have the same meaning. Another may disagree. Are these really the same? I don't believe that encoding the representation is hard, as least not nearly as hard as the very difficult problem of recognizing when two different sentences have the same meaning. If the representation is not compressible, then encoding it optimally is a trivial problem. -- Matt Mahoney, [EMAIL PROTECTED] ----- This list is sponsored by AGIRI: http://www.agiri.org/email To unsubscribe or change your options, please go to: http://v2.listbox.com/member/?member_id=231415&user_secret=fabd7936
