I've added some work to my NuPIC / NLP repo that does POS predictions:
https://github.com/rhyolight/nupic_nlp#parts-of-speech
This experiment does not require the CEPT API, so anyone should be
able to run it just by checking it out and installing. It parses a
given corpus, decodes all the parts of speech tags for each sentence,
and uses a category encoder to pass the POS into NuPIC, predicting the
next POS.
Here is some example output:
$ ./run_pos_experiment.py -t 06_how_thor_got_the_hammer.txt
...
All determiner pronoun
the determiner noun
gods noun noun
felt past tense .
very adverb preposition
sorry adjective proper noun
for preposition noun
little adjective pronoun
Brok proper noun noun
. . past tense
They pronoun pronoun
thought past tense past tense
Loki proper noun pronoun
' past tense
s noun noun
things noun .
were past tense .
fine noun preposition
. . .
...
Column 1: input words
Column 2: POS
Column 3: predicted POS for the same word
There are some interesting things here. NuPIC commonly predicts a
pronoun as the first word after a sentence, because that's the most
common word starting a sentence within the corpus. It also always
predicts a noun will follow a determiner, because they usually do.
While NuPIC isn't doing great, it does tend to pick up small POS
phrases, and is pretty good and predicting the end of sentences. But
this POS problem is not something I'd expect it to nail, frankly. It's
not something a human can do well on either. Each phrase is a tree,
and at any point in the phrase, could branch in multiple directions.
NuPIC is going to make its best guess, but will likely be wrong most
of the time. A more interesting experiment would be to turn this into
an anomaly experiment. Once it's been trained on some text, incoming
nonsense grammar should trigger high anomaly scores.
Another thing you might note is that NLTK doesn't tag all the words
properly. Nouns like "bit" are commonly mis-categorized as a noun
instead of a verb in phrases like "the horse bit the dog", and vice
versa. If anyone is experienced with NLTK, I'd be happy to get some
help improving POS tag accuracy.
I don't have time to continue these experiments, but I hope this lays
some of the groundwork for anyone interested in the NLP focus of the
Hackathon. I've added this to our list of NLP challenges on our wiki:
https://github.com/numenta/nupic/wiki/Natural-Language-Processing#challenges
---------
Matt Taylor
OS Community Flag-Bearer
Numenta
On Thu, Oct 3, 2013 at 10:01 AM, Matthew Taylor <[email protected]> wrote:
> Oh by the way, keep in mind that I'm still a python novice.
> Improvements, clarifications, and pull requests are welcome!
> ---------
> Matt Taylor
> OS Community Flag-Bearer
> Numenta
>
>
> On Thu, Oct 3, 2013 at 9:59 AM, Matthew Taylor <[email protected]> wrote:
>> I've been putting together some experiments with NLP and CEPT's word
>> SDRs. Thanks to Subutai and Francisco for your help with this.
>>
>> I've got some initial decent results, at least proving that we can
>> take CEPT's SDRs as input for the CLA and get predicted SDRs back out
>> and get the "similar terms" for the SDR from CEPT's API.
>>
>> https://github.com/rhyolight/nupic_nlp
>>
>> The README on that repo is extensive, so if you are interested, please
>> get a CEPT API key[1] and try it out with your own word associations.
>> Here is an example (from the README):
>>
>> $ ./run_association_experiment.py resources/animals.txt
>> resources/vegetables.txt -p 100 -t 1000
>> Prediction output for 1000 pairs of terms
>>
>> #COUNT TERM ONE TERM TWO | TERM TWO PREDICTION
>> --------------------------------------------------------------------
>> # 100 salmon endive | lentil
>> # 101 crocodile borage |
>> # 102 wolf turmeric | amaranth
>> # 103 termite chickweed |
>> # 104 quail poke |
>> # 105 woodpecker shallot |
>> # 106 echidna caper | tomato
>> # 107 panther guar |
>> # 108 ape tomatillo | chrysanthemum
>> # 109 bee cabbage |
>> # 110 seahorse sorrel |
>> # 111 camel tomatillo | lemongrass
>> # 112 rat chives |
>> # 113 crab yam | turnip
>>
>> This script takes a random term from the first file and a random term
>> from the second. It converts each term to an SDR through the CEPT API
>> and feeds term #1 and term #2 into NuPIC, bypassing the spacial pooler
>> and sending it right into the TP (as described in the hello_tp
>> example[2]). The next prediction after feeding in term #1 is preserved
>> and printed to the console. Then it resets the TP so that it can only
>> learn that simple one->two relationship. In the sample above, NuPIC
>> should only be predicting plants or vegatables, given that the
>> association I'm training it on is "animal" --> "vegetable".
>>
>> This trivial example seems to be working rather well, although NuPIC
>> doesn't always have a valid SDR prediction. The predictions it does
>> create almost always seem to be some sort of plant. Even more
>> interesting is that sometimes NuPIC predicts SDRs that resolve to
>> words outside the range of the input values.
>>
>> Happy hacking!
>> ---------
>> Matt Taylor
>> OS Community Flag-Bearer
>> Numenta
>>
>> [1] https://cept.3scale.net/signup (YOU MUST upgrade your account to
>> use the API endpoints this project requires, email [email protected]
>> and tell him you're working on NuPIC NLP tasks and he'll upgrade you.)
>> [2] https://github.com/numenta/nupic/blob/master/examples/tp/hello_tp.py
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org