I've added some work to my NuPIC / NLP repo that does POS predictions:

https://github.com/rhyolight/nupic_nlp#parts-of-speech

This experiment does not require the CEPT API, so anyone should be
able to run it just by checking it out and installing. It parses a
given corpus, decodes all the parts of speech tags for each sentence,
and uses a category encoder to pass the POS into NuPIC, predicting the
next POS.

Here is some example output:

$ ./run_pos_experiment.py -t 06_how_thor_got_the_hammer.txt
...
            All           determiner              pronoun
            the           determiner                 noun
           gods                 noun                 noun
           felt           past tense                    .
           very               adverb          preposition
          sorry            adjective          proper noun
            for          preposition                 noun
         little            adjective              pronoun
           Brok          proper noun                 noun
              .                    .           past tense
           They              pronoun              pronoun
        thought           past tense           past tense
           Loki          proper noun              pronoun
              '                                past tense
              s                 noun                 noun
         things                 noun                    .
           were           past tense                    .
           fine                 noun          preposition
              .                    .                    .
...

Column 1: input words
Column 2: POS
Column 3: predicted POS for the same word

There are some interesting things here. NuPIC commonly predicts a
pronoun as the first word after a sentence, because that's the most
common word starting a sentence within the corpus. It also always
predicts a noun will follow a determiner, because they usually do.

While NuPIC isn't doing great, it does tend to pick up small POS
phrases, and is pretty good and predicting the end of sentences. But
this POS problem is not something I'd expect it to nail, frankly. It's
not something a human can do well on either. Each phrase is a tree,
and at any point in the phrase, could branch in multiple directions.
NuPIC is going to make its best guess, but will likely be wrong most
of the time. A more interesting experiment would be to turn this into
an anomaly experiment. Once it's been trained on some text, incoming
nonsense grammar should trigger high anomaly scores.

Another thing you might note is that NLTK doesn't tag all the words
properly. Nouns like "bit" are commonly mis-categorized as a noun
instead of a verb in phrases like "the horse bit the dog", and vice
versa. If anyone is experienced with NLTK, I'd be happy to get some
help improving POS tag accuracy.

I don't have time to continue these experiments, but I hope this lays
some of the groundwork for anyone interested in the NLP focus of the
Hackathon. I've added this to our list of NLP challenges on our wiki:

https://github.com/numenta/nupic/wiki/Natural-Language-Processing#challenges
---------
Matt Taylor
OS Community Flag-Bearer
Numenta


On Thu, Oct 3, 2013 at 10:01 AM, Matthew Taylor <[email protected]> wrote:
> Oh by the way, keep in mind that I'm still a python novice.
> Improvements, clarifications, and pull requests are welcome!
> ---------
> Matt Taylor
> OS Community Flag-Bearer
> Numenta
>
>
> On Thu, Oct 3, 2013 at 9:59 AM, Matthew Taylor <[email protected]> wrote:
>> I've been putting together some experiments with NLP and CEPT's word
>> SDRs. Thanks to Subutai and Francisco for your help with this.
>>
>> I've got some initial decent results, at least proving that we can
>> take CEPT's SDRs as input for the CLA and get predicted SDRs back out
>> and get the "similar terms" for the SDR from CEPT's API.
>>
>> https://github.com/rhyolight/nupic_nlp
>>
>> The README on that repo is extensive, so if you are interested, please
>> get a CEPT API key[1] and try it out with your own word associations.
>> Here is an example (from the README):
>>
>>     $ ./run_association_experiment.py resources/animals.txt
>> resources/vegetables.txt -p 100 -t 1000
>>     Prediction output for 1000 pairs of terms
>>
>>     #COUNT        TERM ONE        TERM TWO | TERM TWO PREDICTION
>>     --------------------------------------------------------------------
>>     #  100          salmon          endive |              lentil
>>     #  101       crocodile          borage |
>>     #  102            wolf        turmeric |            amaranth
>>     #  103         termite       chickweed |
>>     #  104           quail            poke |
>>     #  105      woodpecker         shallot |
>>     #  106         echidna           caper |              tomato
>>     #  107         panther            guar |
>>     #  108             ape       tomatillo |       chrysanthemum
>>     #  109             bee         cabbage |
>>     #  110        seahorse          sorrel |
>>     #  111           camel       tomatillo |          lemongrass
>>     #  112             rat          chives |
>>     #  113            crab             yam |              turnip
>>
>> This script takes a random term from the first file and a random term
>> from the second. It converts each term to an SDR through the CEPT API
>> and feeds term #1 and term #2 into NuPIC, bypassing the spacial pooler
>> and sending it right into the TP (as described in the hello_tp
>> example[2]). The next prediction after feeding in term #1 is preserved
>> and printed to the console. Then it resets the TP so that it can only
>> learn that simple one->two relationship. In the sample above, NuPIC
>> should only be predicting plants or vegatables, given that the
>> association I'm training it on is "animal" --> "vegetable".
>>
>> This trivial example seems to be working rather well, although NuPIC
>> doesn't always have a valid SDR prediction. The predictions it does
>> create almost always seem to be some sort of plant. Even more
>> interesting is that sometimes NuPIC predicts SDRs that resolve to
>> words outside the range of the input values.
>>
>> Happy hacking!
>> ---------
>> Matt Taylor
>> OS Community Flag-Bearer
>> Numenta
>>
>> [1] https://cept.3scale.net/signup (YOU MUST upgrade your account to
>> use the API endpoints this project requires, email [email protected]
>> and tell him you're working on NuPIC NLP tasks and he'll upgrade you.)
>> [2] https://github.com/numenta/nupic/blob/master/examples/tp/hello_tp.py

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to