With all this talk about Natural Language Processing and the hackathon, I figured this is a good time to share a little project my friend and I have recently started using NuPIC. We call it Linguist, and our goal for it is an AI that can read text unsupervised (from Wikipedia and the rest of the internet), build a model of language, and provide better autocorrect / autopredict for mobile keyboards (as the first useful application for it).
So far, we've tried feeding characters, one at a time, to the CLA, each one encoded as a category. We've watched it learn sentences the way the melody-learning AI from the last hackathon learned notes. You can download what we have so far, and try it yourself: https://github.com/chetan51/linguist It might also be a decent platform to start experimenting with NLP tasks for the upcoming hackathon! We have a couple interesting ideas, and a bunch of questions. Some ideas: We want to train it on public text from the internet to build a global model, and then install it on a users phone and have it learn (possibly with higher weight) by the user's own text messages and emails. This would take an already intelligent model and personalize it to the user's own style of writing and vocabulary. We're also thinking of using anomaly detection to fix spelling mistakes, and probability thresholds to suggest the rest of the word, phrase, and sentence without being annoying. We're hoping that the CLA will live up to be a good algorithm for this application, and we're very curious to see how well it will do. Some questions: While playing with it, we noticed that it learns sequences pretty quickly, but patterns very slowly. We repeated a short sentence many times, and it was able to predict fairly correctly the rest of the sentence at every position in the sentence after a couple of repetitions. But when we fed it long text, such as novels from Project Gutenberg, its predictions were almost totally incoherent. Could this be because the CLA is currently implemented as just a single region, without hierarchies? For that matter, how well can a single region do for predicting complex patterns like those in language, beyond just simple character transitions? Do we need hierarchy support before we'll see any decent performance on this task? We're also not totally clear why a perfect run during the short sentence-repetition exercise as described above is sometimes followed by a mistake in the next run. Why exactly, down to the level of details of neuronal connections, can the prediction accuracy go down with an additional repetition of a pattern? Is it because the algorithm is stochastic? We'd love any insight on that :) Finally, we'd like to invite interested parties to join us in exploring this (and related) NLP applications of NuPIC. I would love to learn faster by working with other interested people and bounce ideas off of each other. Let me know if you'd like to chat! Thank you for your time, and your answers to my questions, Chetan
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
