Have you tried tweaking the 'n' (I keep forgetting what n refers to here..) and 'w' (if I remember correctly this is a reference to the width of the number of active bits used in the SDR) for the longer sentences?
On Tue, Aug 27, 2013 at 9:22 PM, Chetan Surpur <[email protected]> wrote: > With all this talk about Natural Language Processing and the hackathon, I > figured this is a good time to share a little project my friend and I have > recently started using NuPIC. We call it Linguist, and our goal for it is > an AI that can read text unsupervised (from Wikipedia and the rest of the > internet), build a model of language, and provide better autocorrect / > autopredict for mobile keyboards (as the first useful application for it). > > So far, we've tried feeding characters, one at a time, to the CLA, each > one encoded as a category. We've watched it learn sentences the way the > melody-learning AI from the last hackathon learned notes. > > You can download what we have so far, and try it yourself: > https://github.com/chetan51/linguist > It might also be a decent platform to start experimenting with NLP tasks > for the upcoming hackathon! > > We have a couple interesting ideas, and a bunch of questions. > > Some ideas: > > We want to train it on public text from the internet to build a global > model, and then install it on a users phone and have it learn (possibly > with higher weight) by the user's own text messages and emails. This would > take an already intelligent model and personalize it to the user's own > style of writing and vocabulary. > > We're also thinking of using anomaly detection to fix spelling mistakes, > and probability thresholds to suggest the rest of the word, phrase, and > sentence without being annoying. We're hoping that the CLA will live up to > be a good algorithm for this application, and we're very curious to see how > well it will do. > > Some questions: > > While playing with it, we noticed that it learns sequences pretty quickly, > but patterns very slowly. We repeated a short sentence many times, and it > was able to predict fairly correctly the rest of the sentence at every > position in the sentence after a couple of repetitions. But when we fed it > long text, such as novels from Project Gutenberg, its predictions were > almost totally incoherent. > > Could this be because the CLA is currently implemented as just a single > region, without hierarchies? For that matter, how well can a single region > do for predicting complex patterns like those in language, beyond just > simple character transitions? Do we need hierarchy support before we'll see > any decent performance on this task? > > We're also not totally clear why a perfect run during the short > sentence-repetition exercise as described above is sometimes followed by a > mistake in the next run. Why exactly, down to the level of details of > neuronal connections, can the prediction accuracy go down with an > additional repetition of a pattern? Is it because the algorithm is > stochastic? We'd love any insight on that :) > > Finally, we'd like to invite interested parties to join us in exploring > this (and related) NLP applications of NuPIC. I would love to learn faster > by working with other interested people and bounce ideas off of each other. > Let me know if you'd like to chat! > > Thank you for your time, and your answers to my questions, > Chetan > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
