--- On Sat, 12/27/08, J. Andrew Rogers <[email protected]> wrote:
> An interesting question is which pattern subset if ignored
> would make the problem tractable.
We don't want to make the problem tractable. We want to discover new, efficient
general purpose learning algorithms. AIXI^tl is intractable, yet we have lots
of fast algorithms for important subsets: linear regression, decision trees,
neural networks, clustering, SVM, etc. If we took out all the problems we
couldn't already solve quickly, then what is the point?
Here is some sample output of the generic compression benchmark data. It
consists of NUL terminated strings packed 8 bits per byte with the low bits of
the last byte padded with zero bits. I sorted the data by decreasing frequency
of occurrence in a sample of 1 million strings. The data is binary, but
displayed here in hex.
The top 20 string are 5 bits or less in length. The most frequent string is all
zero bits, which has an algorithmic complexity of about -log2(47161/1000000) =
4.4 bits in the chosen instruction set.
47161 00
26352 8000
14290 C000
14137 4000
7323 A000
7220 E000
7122 2000
7084 6000
3658 3000
3651 5000
3616 7000
3588 9000
3588 1000
3549 D000
3523 B000
3451 F000
1819 A800
1799 7800
1797 B800
1787 6800
1786 8800
Later we start seeing strings of 1 bits of various length, sometimes with a
leading 0 bit, and patterns of alternating 0 and 1 bits (5555...). The string
format constraint does not allow the obvious case of long strings of 0 bits.
393 1200
392 FFFFFFFFF000
392 AE00
391 B600
390 BE00
388 D600
386 8600
385 5E00
384 BA00
384 4E00
379 7A00
377 FA00
375 F600
374 6A00
373 8A00
373 3A00
371 7600
370 D200
370 9600
369 8E00
368 FFFFFFFFFFFFFFFE00
367 9E00
366 1600
364 7E00
363 9A00
351 FFFFFFFFFFE000
344 FFFFFFFFFFFFFFFFFFFFF800
341 FFFFFFFFF800
325 FFFFFFF000
308 FFFFFFFFFFFFC000
289 FFFFFFFFFFFFFFFFFFFFF000
243 7FFFFFFFFFE000
242 55555555555400
241 FFFFFE00
240 FFFFFFFFFFFFF000
236 FFFFFFF800
230 FFFFFF8000
230 E500
229 FFFFFFC000
224 FF00
224 7FFFFFFFF800
224 0D00
222 9900
219 5555555555555500
218 0500
216 8100
216 7FFFFFFFFFFFE000
215 0100
213 4700
211 FFFFFFFE00
211 4100
210 AD00
209 0300
208 8900
207 1500
Here is a sample from the large set that occur exactly twice, which implies
about 19 bits of algorithmic complexity (probability 2/10^6). A typical
sequence has a few leading bits that occur once, followed by a repeating bit
sequence of length 3-5 or occasionally longer. A hex sequence like 249249249...
is actually the bit sequence 001001001001...
2 E4E4E4E4E4E4E4E4E4E4E400
2 E4E4E4E4E4E4E4E4E4E4C000
2 E4E4E4E4E4E4E4E4E4E000
2 E4E4E4E4E400
2 E4DFFFFFFE00
2 E4DC00
2 E4DB6DB6DB6DB600
2 E4D400
2 E4D0A000
2 E4CCCCCCCCCCCCCCCCCCCC00
2 E4CCCCCCCCCCCCCC00
2 E4CCCCCCCCCCCC00
2 E4CCCCCCCCCC00
2 E4CCCCCCCC00
2 E4C993264C993264C993264C99326400
2 E4C993264C993264C99300
2 E4C993264C993264C99000
2 E4C993264C993264C98000
2 E4C993264C99324000
2 E4C993264C993000
2 E4C800
2 E4C400
2 E4BC9792F25E4BC97900
2 E4B700
2 E4AE00
2 E4AAAAAAAAAAAAAAAA8000
2 E4AAAAAAAAAAAAAAA000
2 E4AAAAAAA800
2 E4A49249249249249000
2 E4A400
2 E492492492492492492492492492492492492000
2 E4924924924924924924924924924924924900
2 E492492492492492492492492492492400
2 E492492492492492492492492492492000
2 E49249249249249249249249249200
2 E49249249249249249249249248000
2 E48A00
2 E4892248922489224892248800
2 E4888888888800
2 E484422110884422110800
2 E48120481204812000
2 E47FFFFFFFFFFE00
Among strings that occur once (which is most of the data), we see many strings
that follow the same type of patterns, but with more unique leading bits and
longer repetition cycles. However you occasionally come across strings that
have no obvious pattern. THOSE are the interesting problems.
1 FC514514514000
1 FC51255125512551255100
1 FC5100
1 FC50F143C50F143C50F143C400
1 FC50D50D50D50D50D50D50D500
1 FC50AB8A15714200
1 FC508000
1 FC507941E507941E507941E500
1 FC5028140A05028000
1 FC4FEEEEB7776000
1 FC4FDC4FDC4FDC00
1 FC4FB6DB6DB6DB6DB6DB6800
1 FC4F62F727C5EE5F00
1 FC4EC9D93B2764EC9D93B27000
1 FC4E66739CE739CC00
1 FC4DC1B89B83713700
1 FC4DB4924924924800
1 FC4D89B13626C4D89B136000
1 FC4D89B13626C4D800
1 FC4D4C4D4C4D4C4D4C00
1 FC4D1C8000
1 FC4D09A1342684D09A13424000
1 FC4CF8933E24CF8000
1 FC4CC400
1 FC4C7C4C7C4C7C4C7C00
1 FC4C4C4C4C4C4C4C4C4C00
1 FC4C4C4C4C4C4C00
1 FC4C1194A32946528000
1 FC4C00
1 FC4BD24924924924924000
1 FC4B89712E25C4B897128000
1 FC4B575B96AEB72D5D6E4000
1 FC4B48D2348D2348D22000
1 FC4B0800
1 FC4A7E253F129F894FC4A78000
1 FC4A5294A52000
1 FC4A473E25239F1291C000
1 FC4A0941282504A09400
1 FC4A00
The best compressors will compress this data to just under 3 MB, which implies
an average algorithmic complexity of less than 24 bits per string. However, the
language allows the construction of arbitrary 128 bit strings in a fairly
straightforward manner.
-- Matt Mahoney, [email protected]
-------------------------------------------
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription:
https://www.listbox.com/member/?member_id=8660244&id_secret=123753653-47f84b
Powered by Listbox: http://www.listbox.com