Right.  The example is not so much to suggest a practical/significant
improvement in the Hutter Prize as to address a general problem:
Specification of information sometimes requires deliberately leaving out
ordering information -- as in set literals like

cats = set(fluffy, scruffy, paws, claws, ...)

This is to make it clear that the data being described has nothing to do
with the order in which the cats' names are listed.

Hierarchical data structures will frequently contain elements like this but
even widely used standards like XML insist on conflating syntactic
structures, rendering it problematic to avoid "specifying" unwanted order.
I had something of a knock-down drag-out fight about this with some Perl
Monks 8 years ago regarding serialization of HTML documents
<https://www.perlmonks.org/?node_id=879166>.

If XML standards abstracted out ordering information where needed, it
would, I think, help the data description world quite a bit and may even
have significant positive practical implications for data modeling.

On Wed, Jan 15, 2020 at 12:30 PM Matt Mahoney <mattmahone...@gmail.com>
wrote:

> Removing the ordering constraint on enwik8  should reduce the compressed
> size by about 50K bytes, or 2 bytes per article. But it wouldn't affect the
> nature of the research. Here is more about the data.
> http://mattmahoney.net/dc/textdata.html
>
> On Tue, Jan 14, 2020, 7:59 AM James Bowery <jabow...@gmail.com> wrote:
>
>> Here's a simple modification to The Hutter Prize
>> <http://prize.hutter1.net/> and the Large Text Compression Benchmark
>> <http://mattmahoney.net/dc/text.html> to illustrate my point:
>>
>> Split the Wikipedia corpus into separate files, one per Wikipedia
>> article.  An entry qualifies only if the set of checksums of the files
>> produced by the self-extracting archive matches that of the original corpus.
>>
>> This reduces the over-constraint imposed by the strictly serialized
>> corpus.
>>
>>
>> On Sun, Jan 5, 2020 at 12:12 PM James Bowery <jabow...@gmail.com> wrote:
>>
>>> In reality, sensors and effectors exist in space as well as time.
>>> Serializing the spatial dimension of observations to formalize their
>>> Kolmogorov Complexity, so they conform to the serialized input to a
>>> Universal Turing machine, over-constrains the observations, introducing
>>> order not relevant to their natural information content, hence artificially
>>> inflating the, so-defined, KC.
>>>
>>> Since virtually all models in machine learning are based on tabular
>>> data, even if they can be cast as time series, row-indexed by a timestamp,
>>> each row is an observation with multiple dimensions.   So it seems rather
>>> interesting, if not frustrating, that the default assumption in Algorithmic
>>> Information Theory is of a serial UTM.
>>>
>>>
>>> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> + delivery
> options <https://agi.topicbox.com/groups/agi/subscription> Permalink
> <https://agi.topicbox.com/groups/agi/Tc33b8ed7189d2a18-Ma929612907338546069466a8>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tc33b8ed7189d2a18-M18a6179432e9e7f2191c719d
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to