Hi Alex, Thank you very much for your input! I am aware of character-based RNNs which have actually been the inspiration to possibly try the same using HTM/Nupic. Of course you are right that there are too many sequences to learn, but I was envisioning to generate a certain likelihood from the learned sequences found in a large training corpus and subsequently detect an anomaly when a threshold is exceeded. Are you suggesting that this approach is unfeasible with Nupic? Due to the size of the necessary learning data or due to theoretical reasons?
I'm not quite sure how the Cortical-API could help on that task though. I thought it was operating on a word level, so I probably cannot derive any conclusions on the character level, can I? Carsten Am 20.10.2015 7:06 nachm. schrieb Alex Lavin <[email protected]>: Hi Carsten, I'm glad you're looking to use NuPIC for NLP. Here's a motivating example from our fall 2013 hackathon: [1]. A couple reasons I would recommend not doing a character-level model: 1. http://www.brainhq.com/brain-resources/brain-teasers/scrambled-text 2. Character-level sequences in TM would essentially memorize the sequences such that you wouldn't be able to generalize to new data. So constraining your model to e.g. a book chapter may work well, but it would not do well on any other chapter of the book. That is, there are far too many character sequences to learn in human language. The theoretical points you raise on human language are accurate. State of the art deep learning models use methods such as sliding windows over text inputs, and bi-directional and/or stacked RNNs that process text both forwards and backwards. I recommend playing around with the Cortical.io API [2], for which they offer a Python client [3] for querying things like word and text encodings. [1] https://www.youtube.com/watch?v=X4XjYXFRIAQ&start=7084 [2] http://api.cortical.io/ [3] https://github.com/cortical-io/python-client-sdk Cheers, Alex
