[nupic-discuss] NLP representation of words WAS: Re: Temporal representation

Marek Otahal Tue, 11 Mar 2014 07:08:49 -0700

Hi Eric,



On Tue, Mar 11, 2014 at 2:34 PM, Eric Collins <[email protected]>wrote:

> I don't know about the rest of you, but I don't think that I have read
> text as strictly left to right from one letter to the next since I was a
> child.  My intuition is that we look at the first couple of letters (and
> subconsciously take in the length of the word), and then we begin to make
> predictions about what the word might be.  I've noticed this in my son as
> he was learning to read. He would see the first few letters and then try to
> guess the rest of the word based on what he thought the first part of the
> word sounded like.  I had to admonish him several times to read the whole
> word and not just guess.  But my intuition tells me that it may not be
> necessary to read every single letter in a word.
>

You are definitely right, as the popular...

*"Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in
waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the
frist and lsat ltteers be at the rghit pclae. The rset can be a toatl mses
and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid
deos not raed ervey lteter by istlef, but the wrod as a wlohe."*

So your idea (also known in NLP AI) is that NLP should be done at the level
of words (word being smallest, atomic part), right?

I think this has (slightly) been discussed on the ML before. There seem to
be atleast 2 approaches:

- a smart way to represent words - a "word encoder"? It could be done as
the example above, fix the first and last letter and use set (unordered)
for the rest. A problem (could be?) with the varying length of words.

Another, slightly different in meaning, representation is the CEPT matrix
(ontology of similar/ antonym terms, keywords, etc..).

-other approach would use hierarchy - one CLA takes input letter by letter,
constructs temporal sequence (predictions) and higher CLAs build on these.
This way is nicer in that it constantly improves the predictions, uses
hierarchies and does not suffer the implementation details (of var. length,
...). However, it's not "natural" in the sense it processes characters
sequentially.



It's fair to say though, that also predictions letter-by-letter have their
use. Eg in spell-checks, text generation (see Hinton's deep NN's for text
generation - there you have input-space of A-Z (=26) on the contrary with
#sizeof(active vocabulary) (~5k words??)

Eg "I eat ba_ _ _ _ ."


>
> We are highly efficient pattern recognizers.  I think our eyes
> instinctively saccad to areas of the pattern that will help us disambiguate
> what we are seeing.  In the case of reading text, we might see the first
> two or three letters and begin making predictions based on them (and
> probably the approximate length of the word as roughly judged by our
> periphery vision).  Our eyes then saccad to the location in the text that
> is (statistically speaking) the most likely to reduce the number of
> potential patterns that have letters at the locations we have already
> scanned with our fovea. (I think something similar to this allows us to
> achieve some degree of spatial invariance when reading at an angle or
> upside down.)  We do this until we have a high enough confidence in our
> prediction to move on.  (I'm sure context is also used in this process as
> well.)
>
> In the past, I have been trying to think about this problem in terms of
> Bayesian analysis, but more recently, my thoughts have been shifting more
> towards the CLA/HTM and sensor-motor integration.  I think there is a
> tremendous amount of potential to perform pattern recognition utilizing
> both the spatial and temporal poolers through the use of saccads.  But, I
> will save that discussion for the nupic-theory list if anyone is interested.
>

PS: CC to nupic-theory ...hope ur not mad. First post!  ;-)

Cheers, Mark


> Eric M. Collins
>
>
>
> On Tue, Mar 11, 2014 at 12:44 AM, Aseem Hegshetye 
> <[email protected]>wrote:
>
>> Hi,
>>
>> Matt: that was a great question, because the whole debate depends on
>> understanding the input.
>>
>> suppose letters are input to the system. Every letter has a predefined
>> representation.
>> C=[1100000]
>> O=[1010000]
>> W=[1001000]
>> H=[1000100]
>> Since all four letters have 1st bit overlapping, their SRD's are going to
>> have overlapping bits.
>> suppose their SDR's are:
>> C=[00010000010000100010000000000100001]
>> O=[01000000010000100010000000000100100]
>> W=[00010000010000100010000001000000100]
>> H=[00010010000000100010000000000100001]
>>
>> I am trying to build a heirarchy. So from temporal pooler i am planning
>> on building higher level SDR. So a word 'COW' will have a SDR and 'HOW'
>> will have a SDR.
>> now a very simple question:
>> Would COW and HOW have overlapping bits? Cant say they will be
>> semantically similar coz both mean different. But actual semantics is too
>> high level. like in subutai's exp. fox had a representation which was
>> semantically similar to something that ate rodent, Do you think these two
>> words COW and HOW should have some representational similarities.
>>
>> I discussed this problem with my roomate, who knows nothing about AI or
>> brain or programming. He said both words sound same mostly. If there was
>> some noise in the surrounding when i heard these words,my prediction could
>> have predicted either of them with equal probability. two words that sound
>> similar, means our cochlea generates significantly similar signals for
>> both, have similar representation in our low level brain, they need not
>> have similar meaning.
>>
>> Both words are distinct[HOW] [COW}, but since last two letters are same,
>> they sound a bit similar, so should they have little semantic similarity?
>>
>> thanks
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>


-- 
Marek Otahal :o)

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

[nupic-discuss] NLP representation of words WAS: Re: Temporal representation

Reply via email to