Matt Crawford wrote:
>>Plus a string of log(N) bits telling you how many times to apply the >>decompression function! >>Uh-oh, now goes over the judge's head ...
Hadmut Danisch wrote:
The problem is that if you ask for a string of log(N) bits, then someone else could take this as a proof that this actually works, because a string of log(N) bits is obviously shorter than the message of N bits, thus the compression scheme is working. Hooray!
That misses the point of the construction that was the subject of Matt's remark. The point was (and remains) that the compressed output (whether it be 1 bit, or 1+log(N) bits, or 1+log^*(N) bits) is ridiculous because it is manifestly undecodeable. It is far, far too good to be true.
The only question is whether the construction is simple enough for the judge to understand.
There is no question whether the construction is a valid _reductio ad absurdum_.
While we are on the subject, I recommend the clean and elegant argument submitted by Victor Duchovni (08/31/2004 03:50 PM) and also in more detail by Matt Crawford (08/31/2004 06:04 PM). It uses mathematical induction rather than proof-by-contradiction. It is a less showy argument, but probably it is understandable by a wider audience. It proves a less-spectacular point, but it is quite sufficient to show that the we-can-compress-anything claim is false. (Although with either approach, at least *some* mathematical sophistication is required. Neither PbC nor MI will give you any traction with ten-year-olds.)
So it appears we have many different ways of approaching things:
1) The pigeon-hole argument. (This disproves the claim that all
N-bit strings are compressible ... even if the claim is restricted
to a particular fixed N.) 2) The mathematical induction argument. (Requires the claimed
algorithm to be non-expansive for a range of N.) 3) The proof-by-contradiction. (Requires the claimed algorithm
to be compressive -- not just non-expansive -- for a range of N.) 4) Readily-demonstrable failure of *any* particular claimed example,
including Lempel-Ziv and all the rest.*) Others?
Harumph. That really ought to be enough. Indeed *one* disproof should have been enough.
The problem is, that the number of iterations is not in the order of N, but in the order of 2^N, so it takes log2(around 2^N) = around N bits to
store the number of iterations.
I don't see why the number of iterations should be exponential in the length (N) of the input string. A compression function is supposed to decrease N. It is not supposed to decrement the number represented by some N-bit numeral .... after all, the string might not represent a number at all.
Also I repeat that there exist special cases (e.g. inputs of known fixed length) for which no extra bits need be represented, as I explained previously.
The recursion convertes a message of N bit recursively into a message of 1 or zero bit length (to your
taste), *and* a number which takes around N bit to be stored. Nothing is won. But proof that.
I disagree, for the reasons given above.
In the worst case, you need log^*(N) extra bits, not N bits. In special cases, you don't need any extra bits at all. The "win" is very substantial. The "win" is extreme.
This recursion game is far more complicated than it appears to be.
Maybe. But there's no need to make it more complicated than it really is.
Note also that storing a number takes in reality more than log(N)
bits. Why? Because you don't know N in advance. We don't have any
limit for the message length.
For general N, that's true.
> So you'r counting register needs
theoretically inifinte many bits.
Maybe. For many practical purposes, the required number of bits is considerably less than infinity.
> When you're finished you know
how many bits your number took. But storing your number needs an end symbol or a tristate-bit (0,1,void). That's a common mistake.
We agree that there are many common mistakes. We agree that it is a mistake to have undelimited strings. But it is also a mistake to think that you need to reserve a special symbol to mark the end of the string. Yes, that is one option, but from a data-compression point of view it is an inefficient option.
Anybody who is interested in this stuff reeeally ought to read Chaitin's work. He makes a big fuss about the existence of self-delimiting strings and self-delimiting programs. There are many examples of such: -- The codewords of many commonly-used compression algorithms are self-delimiting. This is related to the property of being "instantaneously decodable". -- As Chaitin points out, you can set up a dialect of Lisp such that Lisp programs are self-delimiting. -- If you want to be able to represent M, where M is *any* N-bit number, you need more than log(M) bits (i.e. more than N bits). That's because you need to specify how many bits are used to represent log(M), which adds another log(log(M)) bits. On top of that, you need to specify how many bits are used to specify log(log(M)), which adds another ..... you get the idea. Fortunately the series converges verrry fast. This series defines a new function, named log^*(), pronounced log-star. Constructive examples are known of representation schemes that give a self-delimiting representation of *any* integer, no matter how large. These are called 'universal' representations. -- etc. etc. etc.
I'm not suggesting that the judge needs to understand the log^* function ... just that it is a mistake to assume that a special end-marker is needed. Also I'm suggesting that if you give examples of the encoding/decoding process, make sure you indicate whatever metadata (if any!) is required to make the input and output words properly delimited. (You will notice that in the example I posted of diagramming the pigeon-hole argument, all my words were instantaneously decodable and therefore self-delimiting. No metadata required. No reserved end-symbol required.) You don't need to make a big fuss about it, just make sure the requirements are met, in case anybody decides to check. In most cases you can duck the issue entirely by saying that you are compressing "objects". It is often not necessary to specify which part of the object is considered metadata and which part is considered payload.
When determining the compression rate for a file people often forget, that some information is not stored in the file itself, but in
the file system, e.g. the file length (telling where the compressed
data stops) and the file name (telling you, that the file was
compressed). That's basically the same problem.
Yes, it's the same problem, or at least a closely-related problem.
In any case it is a solvable problem.
Note that the *filesystem* itself must be self-delimiting. So it must have a self-delimiting way of representing the name of the file and the length of the file, et cetera. In many cases the filesystem wimps out and places artificial (perhaps huge) limits on the filesize, but one could readily fix things up to use a universal representation.
When comparing compressors, IMHO the only fair thing to do is to consider this metadata in the filesystem to be part of the *object* to be compressed. Theoretically this is more-or-less tantamount to restricting the discussion to inputs and outputs that are self-delimiting.
--------------------------------------------------------------------- The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
