Hi Carsten, I'm glad you're looking to use NuPIC for NLP. Here's a motivating example from our fall 2013 hackathon: [1].
A couple reasons I would recommend not doing a character-level model: 1. http://www.brainhq.com/brain-resources/brain-teasers/scrambled-text 2. Character-level sequences in TM would essentially memorize the sequences such that you wouldn't be able to generalize to new data. So constraining your model to e.g. a book chapter may work well, but it would not do well on any other chapter of the book. That is, there are far too many character sequences to learn in human language. The theoretical points you raise on human language are accurate. State of the art deep learning models use methods such as sliding windows over text inputs, and bi-directional and/or stacked RNNs that process text both forwards and backwards. I recommend playing around with the Cortical.io API [2], for which they offer a Python client [3] for querying things like word and text encodings. [1] https://www.youtube.com/watch?v=X4XjYXFRIAQ&start=7084 [2] http://api.cortical.io/ [3] https://github.com/cortical-io/python-client-sdk Cheers, Alex
