--- On Sat, 12/27/08, J. Andrew Rogers <[email protected]> wrote:

> An interesting question is which pattern subset if ignored
> would make the problem tractable.

We don't want to make the problem tractable. We want to discover new, efficient 
general purpose learning algorithms. AIXI^tl is intractable, yet we have lots 
of fast algorithms for important subsets: linear regression, decision trees, 
neural networks, clustering, SVM, etc. If we took out all the problems we 
couldn't already solve quickly, then what is the point?

Here is some sample output of the generic compression benchmark data. It 
consists of NUL terminated strings packed 8 bits per byte with the low bits of 
the last byte padded with zero bits. I sorted the data by decreasing frequency 
of occurrence in a sample of 1 million strings. The data is binary, but 
displayed here in hex.

The top 20 string are 5 bits or less in length. The most frequent string is all 
zero bits, which has an algorithmic complexity of about -log2(47161/1000000) = 
4.4 bits in the chosen instruction set.

47161 00
26352 8000
14290 C000
14137 4000
 7323 A000
 7220 E000
 7122 2000
 7084 6000
 3658 3000
 3651 5000
 3616 7000
 3588 9000
 3588 1000
 3549 D000
 3523 B000
 3451 F000
 1819 A800
 1799 7800
 1797 B800
 1787 6800
 1786 8800

Later we start seeing strings of 1 bits of various length, sometimes with a 
leading 0 bit, and patterns of alternating 0 and 1 bits (5555...). The string 
format constraint does not allow the obvious case of long strings of 0 bits.

  393 1200
  392 FFFFFFFFF000
  392 AE00
  391 B600
  390 BE00
  388 D600
  386 8600
  385 5E00
  384 BA00
  384 4E00
  379 7A00
  377 FA00
  375 F600
  374 6A00
  373 8A00
  373 3A00
  371 7600
  370 D200
  370 9600
  369 8E00
  368 FFFFFFFFFFFFFFFE00
  367 9E00
  366 1600
  364 7E00
  363 9A00
  351 FFFFFFFFFFE000
  344 FFFFFFFFFFFFFFFFFFFFF800
  341 FFFFFFFFF800
  325 FFFFFFF000
  308 FFFFFFFFFFFFC000
  289 FFFFFFFFFFFFFFFFFFFFF000
  243 7FFFFFFFFFE000
  242 55555555555400
  241 FFFFFE00
  240 FFFFFFFFFFFFF000
  236 FFFFFFF800
  230 FFFFFF8000
  230 E500
  229 FFFFFFC000
  224 FF00
  224 7FFFFFFFF800
  224 0D00
  222 9900
  219 5555555555555500
  218 0500
  216 8100
  216 7FFFFFFFFFFFE000
  215 0100
  213 4700
  211 FFFFFFFE00
  211 4100
  210 AD00
  209 0300
  208 8900
  207 1500

Here is a sample from the large set that occur exactly twice, which implies 
about 19 bits of algorithmic complexity (probability 2/10^6). A typical 
sequence has a few leading bits that occur once, followed by a repeating bit 
sequence of length 3-5 or occasionally longer. A hex sequence like 249249249... 
is actually the bit sequence 001001001001...

     2 E4E4E4E4E4E4E4E4E4E4E400
     2 E4E4E4E4E4E4E4E4E4E4C000
     2 E4E4E4E4E4E4E4E4E4E000
     2 E4E4E4E4E400
     2 E4DFFFFFFE00
     2 E4DC00
     2 E4DB6DB6DB6DB600
     2 E4D400
     2 E4D0A000
     2 E4CCCCCCCCCCCCCCCCCCCC00
     2 E4CCCCCCCCCCCCCC00
     2 E4CCCCCCCCCCCC00
     2 E4CCCCCCCCCC00
     2 E4CCCCCCCC00
     2 E4C993264C993264C993264C99326400
     2 E4C993264C993264C99300
     2 E4C993264C993264C99000
     2 E4C993264C993264C98000
     2 E4C993264C99324000
     2 E4C993264C993000
     2 E4C800
     2 E4C400
     2 E4BC9792F25E4BC97900
     2 E4B700
     2 E4AE00
     2 E4AAAAAAAAAAAAAAAA8000
     2 E4AAAAAAAAAAAAAAA000
     2 E4AAAAAAA800
     2 E4A49249249249249000
     2 E4A400
     2 E492492492492492492492492492492492492000
     2 E4924924924924924924924924924924924900
     2 E492492492492492492492492492492400
     2 E492492492492492492492492492492000
     2 E49249249249249249249249249200
     2 E49249249249249249249249248000
     2 E48A00
     2 E4892248922489224892248800
     2 E4888888888800
     2 E484422110884422110800
     2 E48120481204812000
     2 E47FFFFFFFFFFE00

Among strings that occur once (which is most of the data), we see many strings 
that follow the same type of patterns, but with more unique leading bits and 
longer repetition cycles. However you occasionally come across strings that 
have no obvious pattern. THOSE are the interesting problems.

    1 FC514514514000
    1 FC51255125512551255100
    1 FC5100
    1 FC50F143C50F143C50F143C400
    1 FC50D50D50D50D50D50D50D500
    1 FC50AB8A15714200
    1 FC508000
    1 FC507941E507941E507941E500
    1 FC5028140A05028000
    1 FC4FEEEEB7776000
    1 FC4FDC4FDC4FDC00
    1 FC4FB6DB6DB6DB6DB6DB6800
    1 FC4F62F727C5EE5F00
    1 FC4EC9D93B2764EC9D93B27000
    1 FC4E66739CE739CC00
    1 FC4DC1B89B83713700
    1 FC4DB4924924924800
    1 FC4D89B13626C4D89B136000
    1 FC4D89B13626C4D800
    1 FC4D4C4D4C4D4C4D4C00
    1 FC4D1C8000
    1 FC4D09A1342684D09A13424000
    1 FC4CF8933E24CF8000
    1 FC4CC400
    1 FC4C7C4C7C4C7C4C7C00
    1 FC4C4C4C4C4C4C4C4C4C00
    1 FC4C4C4C4C4C4C00
    1 FC4C1194A32946528000
    1 FC4C00
    1 FC4BD24924924924924000
    1 FC4B89712E25C4B897128000
    1 FC4B575B96AEB72D5D6E4000
    1 FC4B48D2348D2348D22000
    1 FC4B0800
    1 FC4A7E253F129F894FC4A78000
    1 FC4A5294A52000
    1 FC4A473E25239F1291C000
    1 FC4A0941282504A09400
    1 FC4A00

The best compressors will compress this data to just under 3 MB, which implies 
an average algorithmic complexity of less than 24 bits per string. However, the 
language allows the construction of arbitrary 128 bit strings in a fairly 
straightforward manner.

-- Matt Mahoney, [email protected]





-------------------------------------------
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=123753653-47f84b
Powered by Listbox: http://www.listbox.com

Reply via email to