Re: [nupic-dev] HTM in Natural Language Processing

Jeff Hawkins Tue, 27 Aug 2013 12:12:43 -0700

I am not following your logic here.  A column in the SP needs to be able to
make connections to a large number of input bits, and then learn to make
connections to a small subset of them.  So I don't understand the 3x3
comment.

There are several reasons we might want to use a spatial pooler.

1) The SP converts the dimension (# bits) of an input into the dimension (#
columns) of the SP.  It does this dimension change in an elegant way that
always does a pretty good job.

2) The SP converts an input of any sparsity into a relatively fixed sparsity
(so the TP can work well).  The percentage of active input bits can vary
(all being somewhat sparse) and the SP will make it fixed.

3) The SP learns what bits in the input are useful for spatial correlations.
It forms connections to these bits and those that don't correlate are not
used.  In general the SP forms columns that represent commonly seen patterns
in the input and this biases the representations passed to the TP.  This
comes at the expense of rarely seen patterns.

All three reasons are valid for using the SP with CEPT's word SDRs.
However, if we made the number of columns match the number of input bits and
we could force the word SDRs to each have the same number of active bits,
then we could at least try skipping the SP.  But it may not be worth it.

Jeff

From: nupic [mailto:[email protected]] On Behalf Of Francisco
Webber
Sent: Monday, August 26, 2013 11:39 AM
To: NuPIC general mailing list.
Subject: Re: [nupic-dev] HTM in Natural Language Processing

Jeff,

Im an still not completely convinced that skipping the SP is a good thing to
do.

It is true that when you feed scalars into the system the SP acts like an
SDRizer but in the case of the text-retina we already get SDRs in a first
place. I believe that in this case, the SP learns another aspect of the
data, namely the semantic topology of the input pattern. This leads me to a
scheme where each column gets a field of, lets say, 9 input bits arranged as
3x3 grid.

depending on the amount of memory one can spend, these 3x3 bits could be fed
in a non overlapping mode. This would mean that the 128x128 sensor bits need
an array of 43x43 colums = 1849.

If we would decide to overlap the 3x3 fields by one bit, the 128x128 sensor
array would be mapped to 64x64  = 4096 columns.

Francisco

On 26.08.2013, at 20:10, Jeff Hawkins wrote:

I am sold on the kid's story idea.  I looked at the link below and there is
a lot of meta data in this file.  It would have to be removed before feeding
to the CLA.

My assumption is that we would need a CLA with more columns than the
standard 2048.  How many bits are in your word fingerprints?  Could we make
each bit a column and skip the SP?

Jeff

From: nupic [mailto:[email protected]] On Behalf Of Francisco
Webber
Sent: Monday, August 26, 2013 3:50 AM
To: NuPIC general mailing list.
Subject: Re: [nupic-dev] HTM in Natural Language Processing

Ian,

I also thought about something from the Gutenberg repository.

But I think we should start with something from the Kids Shelf.

There are several reasons in my opinion:

- We start experimentation with a full bag of unknown parameters, so keeping
the test material simple would allow us to detect the important ones sooner.
And it is quite some work to create a reliable evaluation framework, so the
size of the data set makes a difference.

- Keeping the text simple and short reduces substantially the overall
vocabulary. If we want people to also evaluate offline, matching
fingerprints can become a lengthy process without an efficient similarity
engine.

- Another reason is the fact that we don't know how much a given set of
columns (like the 2048 typically used) can absorb information. In other
words: what is the optimal ratio between a first layer of a text-HTM and the
amount of text.

- Lastly I believe that the sequence in which text is presented to the CLA
is of importance. After all when humans learn information by reading, they
also start from simple to complex language. The amount of new vocabulary
during training, should be relatively stable (the actual amount would
probably be linked to the ratio of my previous argument) 

So we should build continuously more complex training data sets, finally
ending up with "true"  books like the ones you listed.

To start I would suggest something like:

A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by Children

http://www.gutenberg.org/ebooks/7841

But there might still be better ones.

Francisco

On 25.08.2013, at 23:05, Ian Danforth wrote:

I will make 3 suggestions. All are out of copyright, well known,
uncontroversial, and still taught in schools (At least in the US)

1. Robinson Crusoe - Daniel Defoe

http://www.gutenberg.org/ebooks/521

2. Great Expectations - Charles Dickens

http://www.gutenberg.org/ebooks/1400

3. The Time Machine - H.G. Wells

http://www.gutenberg.org/ebooks/35

Ian

On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> wrote:

For those who don't want to use the API and for evaluation purposes, I would
propose that we choose some reference text and I convert it into a sequence
of SDRs. This file could be used for training.

I would also generate a list of all words contained in the text, together
with their SDRs to be used as conversion table.

As a simple test measure we could feed a sequence of SDRs into a trained
network and see if the HTM makes the right prediction about the following
word(s). 

The last file to produce for a complete framework would be a list of lets
say 100 word sequences with their correct continuation.

The word sequences could be for example the beginnings of phrases with more
than n words (n being the number of steps ahead that the CLA can predict
ahead)

This could be the beginning of a measuring set-up that allows to compare
different CLA-implementation flavors.

Any suggestions for a text to choose?

Francisco

On 24.08.2013, at 17:12, Matthew Taylor wrote:

Very cool, Francisco. Here is where you can get cept API credentials:
https://cept.3scale.net/signup

---------

Matt Taylor

OS Community Flag-Bearer

Numenta

On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> wrote:

Just a short post scriptum:

The public version of our API doesn't actually contain the generic
conversion function. But if people from the HTM community want to experiment
just click the "Request for Beta-Program" button and I will upgrade your
accounts manually.

Francisco

On 24.08.2013, at 01:59, Francisco Webber wrote:

> Jeff,
> I thought about this already.
> We have a REST API where you can send a word in and get the SDR back, and
vice versa.
> I invite all who want to experiment to try it out.
> You just need to get credentials at our website: www.cept.at
<http://www.cept.at/> .
>
> In mid-term it would be cool to create some sort of evaluation set, that
could be used to measure progress while improving the CLA.
>
> We are continuously improving our Retina but the version that is currently
online works pretty well already.
>
> I hope that will help
>
> Francisco
>
> On 24.08.2013, at 01:46, Jeff Hawkins wrote:
>
>> Francisco,
>> Your work is very cool.  Do you think it would be possible to make
available
>> your word SDRs (or a sufficient subset of them) for experimentation?  I
>> imagine there would be interested in the NuPIC community in training a
CLA
>> on text using your word SDRs.  You might get some useful results more
>> quickly.  You could do this under a research only license or something
like
>> that.
>> Jeff
>>
>> -----Original Message-----
>> From: nupic [mailto:[email protected]] On Behalf Of
Francisco
>> Webber
>> Sent: Wednesday, August 21, 2013 1:01 PM
>> To: NuPIC general mailing list.
>> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>>
>> Hello,
>> I am one of the founders of CEPT Systems and lead researcher of our
retina
>> algorithm.
>>
>> We have developed a method to represent words by a bitmap pattern
capturing
>> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill
all
>> the requirements for "good" HTM input data.
>>
>> - Words with similar meaning "look" similar
>> - If you drop random bits in the representation the semantics remain
intact
>> - Only a small number (up to 5%) of bits are set in a word-SDR
>> - Every bit in the representation corresponds to a specific semantic
feature
>> of the language used
>> - The Retina (sensory organ for a HTM) can be trained on any language
>> - The retina training process is fully unsupervised.
>>
>> We have found out that the word-SDR by itself (without using any HTM yet)
>> can improve many NLP problems that are only poorly solved using the
>> traditional statistic approaches.
>> We use the SDRs to:
>> - Create fingerprints of text documents which allows us to compare them
for
>> semantic similarity using simple (euclidian) similarity measures
>> - We can automatically detect polysemy and disambiguate multiple
meanings.
>> - We can characterize any text with context terms for automatic
>> search-engine query-expansion .
>>
>> We hope to successfully link-up our Retina to an HTM network to go beyond
>> lexical semantics into the field of "grammatical semantics".
>> This would hopefully lead to improved abstracting-, conversation-,
question
>> answering- and translation- systems..
>>
>> Our correct web address is www.cept.at <http://www.cept.at/>  (no
kangaroos in Vienna ;-)
>>
>> I am interested in any form of cooperation to apply HTM technology to
text.
>>
>> Francisco
>>
>> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote:
>>
>>>
>>> Hello.
>>>
>>> As many of you here i am prety new in HTM technology.
>>>
>>> I am a researcher in Brazil and I am going to start my Phd program soon.
>> My field of interest is NLP and the extraction of knowledge from text. I
am
>> thinking to use the ideas behind the Memory Prediction Framework to
>> investigate semantic information retrieval from the Web, and answer
>> questions in natural language. I intend to use the HTM implementation as
>> base to do this.
>>>
>>> I apreciate a lot if someone could answer some questions:
>>>
>>> - Are there some researches related to HTM and NLP? Could indicate them?
>>>
>>> - Is HTM proper to address this problem? Could it learn, without
>> supervision, the grammar of a language or just help in some aspects as
Named
>> Entity Recognition?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Christian
>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] HTM in Natural Language Processing

Reply via email to