Unsupervised feature discovery of text that started with its bit string
representation would need to discover octets were the first-order parse of such
a bit string. This raises a question:
What is the technique called that can discover that a binary string, for
example:
0100100111110010101010101011111011001000111000100101110001111110111010111110010111001010100011111110001101100101001010101111000111101011010011111001111101001111101011111011110011011001111000010100110001
has the simple model (with A x B meaning A occurs B times in the bag):
{00 x 1, 01 x 2, 10 x 3, 11 x 4}
even though it knows only that it should group bits in substrings (tokens) of
the same bit length (ie: it doesn't know it should group bits in pairs)?
That is to say, if the binary string input was generated by a perl program:
for(0..100){print ( (('00') x 1, ('01') x 2, ('10') x 3, ('11') x 4)[rand(10)])}
the technique would reject, as less predictive, the distribution (model):
{0 x 7, 1 x 13}
and it would also reject a model that used 2 bit tokens on odd-numbered bit
boundaries, as well as models that used 3 bit, or longer, tokens.
A related, more difficult technique, would find the model for a string
generated by sampling the bag:
{0 x 1, 1 x 1, 00 x 1, 01 x 2, 10 x 3, 11 x 4}}
That is to say the bit string is a mix of token sizes.
------------------------------------------
Artificial General Intelligence List: AGI
Permalink:
https://agi.topicbox.com/groups/agi/T07349206c4d4db02-M96c74be7cfbdd01096c4f505
Delivery options: https://agi.topicbox.com/groups/agi/subscription