--- Mark Waser <[EMAIL PROTECTED]> wrote:
> > Can you argue
> > that the representation is at least half of the information?
> 
> Yes, I can.  Take any case that involves a unit of measurement.  The 
> statements "John is x inches tall", "John is y centimeters tall", "John is 
> a.b feet tall", "John is c.d meters tall", "John is 1/e miles tall" can be 
> reproduced nearly ad infinitum.

But people don't normally speak that way.  My guess is that 80% to 90% of the
information content of normal written English is meaningful, and 10% to 20% is
choice of representation.  Take a random string of text of length n characters
and ask how many ways it can be rewritten without changing the meaning.  My
guess is around 2^0.1n to 2^0.2n.

For a parser plus knowledge base plus text generator like you described
(assuming one could be built that understands the majority of English
sentences), then a lossy test would be appropriate.  Encode and decode a
random sentence and ask a judge if it means the same thing.  You could
evaluate the system by the size of the compressed database.  If your system
can recognize the same fact phrased in two different ways, then the second
input would not increase the database size at all.

Of course this test is subjective.  One judge may say that "John is 6 feet
tall" and "John is 182 cm tall" have the same meaning.  Another may disagree. 
Are these really the same?

I don't believe that encoding the representation is hard, as least not nearly
as hard as the very difficult problem of recognizing when two different
sentences have the same meaning.  If the representation is not compressible,
then encoding it optimally is a trivial problem.


-- Matt Mahoney, [EMAIL PROTECTED]

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?member_id=231415&user_secret=fabd7936

Reply via email to