Re: [agi] Lossy & lossless compression

Matt Mahoney Fri, 25 Aug 2006 16:55:34 -0700

----- Original Message ----
From: Mark Waser <[EMAIL PROTECTED]>
To: agi@v2.listbox.com
Sent: Friday, August 25, 2006 5:58:02 PM
Subject: Re: [agi] Lossy *&* lossless compression


>> However, a machine with a lossless model will still outperform one with a 
>> lossy model because the lossless model has more knowledge.

>PKZip has a lossless model.  Are you claiming that it has more knowledge? 
>More data/information *might* be arguable but certainly not knowledge -- and 
>PKZip certainly can't use any "knowledge" that you claim that it "has".

DEL has a lossy model, and nothing compresses smaller.  Is it smarter than 
PKZip?
 
 Let me state one more time why a lossless model has more knowledge.  If x and 
x' have the same meaning to a lossy compressor (they compress to identical 
codes), then the lossy model only knows p(x)+p(x').  A lossless model also 
knows p(x) and p(x').  You can argue that if x and x' are not distinguishable 
then this extra knowledge is not important.  But all text strings are 
distinguishable to humans.

But let me give an example of what we have already learned from lossless 
compression tests.

1. PKZip, bzip2, ppmd, etc. model text at the character (ngram) level.
2. WinRK and paq8h model text at the lexical level using static dictionaries.  
They compress better than (1).
3. xml-wrt|ppmonstr and paq8hp1 model text at the lexical level using 
dictionaries learned from the input.  They compress better than (2).

I think you can see the pattern.

There has been research in semantic models using distant bigrams and LSA.  
These compress cleaned text (restricted vocabulary, no punctuation) better than 
models without these capabilities, as measured by word perplexity.  Currently 
there are no general purpose compressors that model syntax or semantics, 
probably because such models are only useful on large text corpora, not the 
kind of files people normally compress.  I think that will change if there is a 
financial incentive.

>> This does not change the fact that lossless compression is the right way 
>> to evaluate a language model.

>. . . . in *your* opinion.  I might argue that it is the *easiest* way to 
>evaluate a language model but certainly NOT the best -- and I would then 
>argue, therefore, not the "right" way either.

Also in the opinion of speech recognition researchers studying language models 
since the early 1990's.

>> A lossy model cannot be evaluated objectively

>Bullsh*t.  I've given you several examples of how.  You've discarded them 
>because you felt that they were "too difficult" and/or you didn't understand 
>them.

Deciding if a lossy decompression is "close enough" is an AI problem, or it 
requires subjective judging by humans.  Look at benchmarks for video or audio 
codecs.  Which sounds better, AAC or Ogg?
 
-- Matt Mahoney, [EMAIL PROTECTED]





-------
To unsubscribe, change your address, or temporarily deactivate your 
subscription, 
please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

Re: [agi] Lossy *&* lossless compression

Reply via email to

Re: [agi] Lossy & lossless compression