With all this talk about Natural Language Processing and the hackathon, I
figured this is a good time to share a little project my friend and I have
recently started using NuPIC. We call it Linguist, and our goal for it is
an AI that can read text unsupervised (from Wikipedia and the rest of the
internet), build a model of language, and provide better autocorrect /
autopredict for mobile keyboards (as the first useful application for it).

So far, we've tried feeding characters, one at a time, to the CLA, each one
encoded as a category. We've watched it learn sentences the way the
melody-learning AI from the last hackathon learned notes.

You can download what we have so far, and try it yourself:
https://github.com/chetan51/linguist
It might also be a decent platform to start experimenting with NLP tasks
for the upcoming hackathon!

We have a couple interesting ideas, and a bunch of questions.

Some ideas:

We want to train it on public text from the internet to build a global
model, and then install it on a users phone and have it learn (possibly
with higher weight) by the user's own text messages and emails. This would
take an already intelligent model and personalize it to the user's own
style of writing and vocabulary.

We're also thinking of using anomaly detection to fix spelling mistakes,
and probability thresholds to suggest the rest of the word, phrase, and
sentence without being annoying. We're hoping that the CLA will live up to
be a good algorithm for this application, and we're very curious to see how
well it will do.

Some questions:

While playing with it, we noticed that it learns sequences pretty quickly,
but patterns very slowly. We repeated a short sentence many times, and it
was able to predict fairly correctly the rest of the sentence at every
position in the sentence after a couple of repetitions. But when we fed it
long text, such as novels from Project Gutenberg, its predictions were
almost totally incoherent.

Could this be because the CLA is currently implemented as just a single
region, without hierarchies? For that matter, how well can a single region
do for predicting complex patterns like those in language, beyond just
simple character transitions? Do we need hierarchy support before we'll see
any decent performance on this task?

We're also not totally clear why a perfect run during the short
sentence-repetition exercise as described above is sometimes followed by a
mistake in the next run. Why exactly, down to the level of details of
neuronal connections, can the prediction accuracy go down with an
additional repetition of a pattern? Is it because the algorithm is
stochastic? We'd love any insight on that :)

Finally, we'd like to invite interested parties to join us in exploring
this (and related) NLP applications of NuPIC. I would love to learn faster
by working with other interested people and bounce ideas off of each other.
Let me know if you'd like to chat!

Thank you for your time, and your answers to my questions,
Chetan
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to