Unsupervised feature discovery of text that started with its bit string 
representation would need to discover octets were the first-order parse of such 
a bit string. This raises a question:


What is the technique called that can discover that a binary string, for 
example:


0100100111110010101010101011111011001000111000100101110001111110111010111110010111001010100011111110001101100101001010101111000111101011010011111001111101001111101011111011110011011001111000010100110001


has the simple model (with A x B meaning A occurs B times in the bag):


{00 x 1, 01 x 2, 10 x 3, 11 x 4}


even though it knows only that it should group bits in substrings (tokens) of 
the same bit length (ie: it doesn't know it should group bits in pairs)?


That is to say, if the binary string input was generated by a perl program:


for(0..100){print ( (('00') x 1, ('01') x 2, ('10') x 3, ('11') x 4)[rand(10)])}


the technique would reject, as less predictive, the distribution (model):


{0 x 7, 1 x 13}


and it would also reject a model that used 2 bit tokens on odd-numbered bit 
boundaries, as well as models that used 3 bit, or longer, tokens.


A related, more difficult technique, would find the model for a string 
generated by sampling the bag:


{0 x 1, 1 x 1, 00 x 1, 01 x 2, 10 x 3, 11 x 4}}


That is to say the bit string is a mix of token sizes.



------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T07349206c4d4db02-M96c74be7cfbdd01096c4f505
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to