Hi Carsten,
I was pointing you to the Cortical.io API in case you opted for a word- or 
text-level model. For a character-level model you would simply use random, 
one-hot representations such that semantic similarities aren't encoded into the 
SDRs. You can use the CategoryEncoder [1] for this.

The feasibility of the TM learning character-level sequences depends on how you 
define the learned sequences, and subsequently how many there are. That is, is 
a single sequence defined as a word, a sentence, a paragraph? Do the sequences 
repeat in the training? If not, the TM won't learn them.

The TM should have sufficient capacity given you have enough cells per column 
[2] and segments per cell [3]. TM is a memory of sequences, but fundamentally 
it learns transitions between inputs. Thus the capacity is measured by how many 
transitions a TM region can store. For example, a TM region of 2% column 
activation (i.e. sparsity), 32 cells per column, and 128 segments per cell can 
store approximately (32/0.02)*128 = 204,800 transitions.

[1] https://github.com/numenta/nupic/blob/master/src/nupic/encoders/category.py
[2] 
https://github.com/numenta/nupic/blob/master/examples/opf/clients/hotgym/simple/model_params.py#L151
[3] 
https://github.com/numenta/nupic/blob/master/examples/opf/clients/hotgym/simple/model_params.py#L183

Cheers,
Alex

Reply via email to