On Tue, Jul 7, 2020 at 2:31 PM Matt Mahoney <mattmahone...@gmail.com> wrote:

> Why bother with a CIC training and test set? Compression evaluates every
> bit as a test given the previous bits as training. Even if the compression
> algorithm doesn't explicitly predict bits, it is equivalent to one that
> does by the chain rule. The probability of a string is equal to the product
> of the conditional probabilities of its symbols.
>
> You can see this effect at work in http://mattmahoney.net/dc/text.html
> The ranking of enwik8 (first 100 MB) closely tracks the ranking of enwik9.
> Most of the variation is due to memory constraints. In small memory models,
> compression is worse overall and closer to the result you would get from
> compressing the parts independently.
>
> Occam's Razor doesn't necessarily hold under constrained resources. All
> probability distributions over an infinite set of strings must favor
> shorter ones, but that isn't necessarily true over the finite set of
> programs that can run on a computer with finite memory.
>

Yes, and that is the most vocal of Ben's critiques of what we're calling
now (I guess) *The COIN Hypothesis* (however much of a stretch it is to get
to the memetic advantage of that acronym).  To reiterate that hypothesis
with a little more nuance and refinement including emphasis on resource
constraint:

*The COIN (COmpression Information criterioN) Hypothesis is about the
empirical world *and is *both* that among:

   1. existing models of a process, the one producing the smallest
   executable archive of the same training data and within *the same
   computation constraints*, will *generally* also produce the smallest
   executable archive of the test data, AND
   2. *all* model selection criterion, will do so *more* *generally*.

More to the point, whenever I discuss COIN as *the* model selection
criterion, the obvious fact that it isn't...

   - a mathematically provable aspect of "the unreasonable effectiveness of
   mathematics in the natural sciences"
   - empirically tested (although it seems measurable)
   - widespread, nor even minority use

...people react in one of 3 ways, in order of frequency:

   1. Huh?  Wha?  Fuhgeddaboudit.
   2. Where's the empirical evidence?
   3. Minimum Description Length Principle is just the Bayesian Information
   Criterion.
   4. You're just plain wrong because _insert some invalid critique_.

Indeed, the research program I set forth should be pursued if for no other
reason than to rank order the general practicality of various model
selection criteria.

On Tue, Jul 7, 2020 at 2:31 PM Matt Mahoney <mattmahone...@gmail.com> wrote:

> Why bother with a CIC training and test set? Compression evaluates every
> bit as a test given the previous bits as training. Even if the compression
> algorithm doesn't explicitly predict bits, it is equivalent to one that
> does by the chain rule. The probability of a string is equal to the product
> of the conditional probabilities of its symbols.
>

The practice of dividing training and test data is standard industry
practice.  Why should it not be pursued in this instance since the point is
to convince people of the truth or falsity of COIN?

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Ta901988932dbca83-M9076d4fa7ac88c4402052595
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to