Hi Carsten, I was pointing you to the Cortical.io API in case you opted for a word- or text-level model. For a character-level model you would simply use random, one-hot representations such that semantic similarities aren't encoded into the SDRs. You can use the CategoryEncoder [1] for this.
The feasibility of the TM learning character-level sequences depends on how you define the learned sequences, and subsequently how many there are. That is, is a single sequence defined as a word, a sentence, a paragraph? Do the sequences repeat in the training? If not, the TM won't learn them. The TM should have sufficient capacity given you have enough cells per column [2] and segments per cell [3]. TM is a memory of sequences, but fundamentally it learns transitions between inputs. Thus the capacity is measured by how many transitions a TM region can store. For example, a TM region of 2% column activation (i.e. sparsity), 32 cells per column, and 128 segments per cell can store approximately (32/0.02)*128 = 204,800 transitions. [1] https://github.com/numenta/nupic/blob/master/src/nupic/encoders/category.py [2] https://github.com/numenta/nupic/blob/master/examples/opf/clients/hotgym/simple/model_params.py#L151 [3] https://github.com/numenta/nupic/blob/master/examples/opf/clients/hotgym/simple/model_params.py#L183 Cheers, Alex
