Hi Alex,
Thank you very much for your input! I am aware of character-based RNNs which 
have actually been the inspiration to possibly try the same using HTM/Nupic. Of 
course you are right that there are too many sequences to learn, but I was 
envisioning to generate a certain likelihood from the learned sequences found 
in a large training corpus and subsequently detect an anomaly when a threshold 
is exceeded.
Are you suggesting that this approach is unfeasible with Nupic? Due to the size 
of the necessary learning data or due to theoretical reasons?

I'm not quite sure how the Cortical-API could help on that task though. I 
thought it was operating on a word level, so I probably cannot derive any 
conclusions on the character level, can I?
Carsten

Am 20.10.2015 7:06 nachm. schrieb Alex Lavin <[email protected]>:
Hi Carsten,
I'm glad you're looking to use NuPIC for NLP. Here's a motivating example from 
our fall 2013 hackathon: [1].

A couple reasons I would recommend not doing a character-level model:
  1. http://www.brainhq.com/brain-resources/brain-teasers/scrambled-text
  2. Character-level sequences in TM would essentially memorize the sequences 
such that you wouldn't be able to generalize to new data. So constraining your 
model to e.g. a book chapter may work well, but it would not do well on any 
other chapter of the book. That is, there are far too many character sequences 
to learn in human language.

The theoretical points you raise on human language are accurate. State of the 
art deep learning models use methods such as sliding windows over text inputs, 
and bi-directional and/or stacked RNNs that process text both forwards and 
backwards.

I recommend playing around with the Cortical.io API [2], for which they offer a 
Python client [3] for querying things like word and text encodings.

[1] https://www.youtube.com/watch?v=X4XjYXFRIAQ&start=7084
[2] http://api.cortical.io/
[3] https://github.com/cortical-io/python-client-sdk

Cheers,
Alex

Reply via email to