Re: [nupic-dev] Subject: Re: HTM in Natural Language Processing

Fergal Byrne Sat, 14 Sep 2013 04:18:56 -0700

Hi Ivan,

You're completely correct about increasing the amount of information coming
in, but that is precisely what we want.

If you just want to treat words as abstract labelled symbols, then we
already have methods for that. In that case all your doing is learning the
sequence of objects, without any "understanding" about *why* the sequence
is as it is. You could just as easily be learning sequences of notes (and
not learning anything about the rules or "quality" of music), or, as we
used to watch on BBC's Generation Game, a sequence of consumer goods
passing by on a conveyor belt. Each successive object gives the CLA no
information about what the the next object should be.

When you add all this semantic information using the CEPT data, you're
learning what *kinds of words* fit together in a sequence. Humans learn
language based on categorisation, inference and generalisation, so the
stream of words must contain structural information which allows this to
happen.

For simple English sentences, for example, the first object is usually a
noun (the subject of the sentence), and the second object is usually a verb
which agrees in number with the subject. If the verb is transitive, this is
usually followed by another noun, the object of the sentence. Variations on
this pattern exist, but each one operates on its own very strict set of
rules, again providing a structure which is contained in the (generalised)
grammar of the language, the details of which are contained in the
experienced streams of words.

As we learn English (or any language), we learn the semantic category and
grammatical settings for each word along with the meaning and sound of the
word. We also learn the rules for fitting these things together in
grammatically correct and semantically sensible ways. We can generalise and
innovate based on the constraints of these rules, and we can detect and
error-correct when the rules are broken. The language itself is learned in
this holistic, holographic way, extracted only from what we hear and learn
to think and produce (there is likely to be some hardwired "Universal
Grammar" in our brains which is probably implemented by the way that
various regions are wired together).

Pinker's *The Language Instinct* is a great survey of these ideas.

So, this exercise is a first set of steps in exploring how we can use the
CLA to interface with this semantic structure of natural language. We're
going to see how well a single layer, single region, small CLA deals with
the structural and semantic information about sentences which the CEPT
encoding provides.

By the way, this is also a test of the usefulness of the CEPT concept and
implementation, which itself was motivated by Francisco's awareness of the
work Jeff et al have been doing on the CLA. We'll be seeing how well the
CEPT data embodies the kind of semantic information we believe is needed
for use in NLP. CEPT have already begun adapting their software in response
to feedback from the NuPIC/Grok side, and they're clearly very interested
in using our work and the hackathon to improve the CEPT SDR system and its
power as an NLP encoding scheme.

In terms of capacity, there's not much to worry about, IMHO. The CEPT SDR's
are 16k bitmaps, and a 2k CLA should be able to deal with that. The input
here is really an emulation of a (hypothetical) high-level language
processing region being fed with a lower- (but still high-) level data
stream in the language of the neocortex - an SDR. The 16k of data is really
a set of sub-SDR's, each of which is encoding a different semantic aspect
of the underlying word. This is similar (structurally) to the hierarchical
organisation which we know exists in the brain for handling language, and
indeed for most corticocortical hierarchy.

We'll find out how well this works when we start the exploration, and we
will easily be able to up the size of the CLA if we want to try that.
Personally, I think that a 2k CLA has an enormous capacity - much higher
than you would think - because of the way that it efficiently deploys
connections and predictor cells based solely on the actual content and
structure in the data.

Regards,

Fergal Byrne

On Sat, Sep 14, 2013 at 7:52 AM, Ivan Sytenko <[email protected]> wrote:

> Jeff,
>
> I see a methodological problem here.
>
> CEPT representation of the word increases the size of the code with
> semantic information. This increases the size of the text presentation.
>
> The semantic information is not required in the task of remembering. This
> information will overload your system.
>
> On the other hand, you increase the words intersection. This strongly
> complicates the semantic analysis too.
>
> Do not get me wrong. I want to save people time and computers time only.
>
> Please do not explain if I'm wrong.
>
> Ivan
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

-- 

Fergal Byrne

ExamSupport/StudyHub [email protected] http://www.examsupport.ie
Dublin in Bits [email protected] http://www.inbits.com +353 83
4214179
Formerly of Adnet [email protected] http://www.adnet.ie

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] Subject: Re: HTM in Natural Language Processing

Reply via email to