Sorry I have been a little absent on this list.  I was travelling this week
and I am preparing for OsCON next week so I can't keep up with all the
conversations.

 

Most image classification systems rely on some form of what we call
"temporal pooling".  (Mike described it well below.)  E.g.  HMAX, a vision
system out of Poggio's lab at MIT uses a hard-coded pooling mechanism.  They
take their spatial features and hard code representations that are active
for spatial shifts of the feature.   Hard coded pooling works OK for the
first level of a vision hierarchy but it doesn't work in a general sense.
For example, in audition we need to pool patterns in time that have no
obvious spatial invariance.  We might want to pool successive notes in a
melody and there is no equivalent of spatial invariance for that.
Therefore, a cortical region must learn what patterns to pool over time.

 

We did some vision work prior to the CLA.  These algorithms did not have a
good temporal model and we actually used hard coded pooling ala HMAX.  I was
never happy about this although it produce ok but not great results.

 

When we first created the CLA and were using it for vision experiments we
spent a lot of time making sure it could learn temporal pooling.  The idea
is you first learn a sequence, but this on its own doesn't do any pooling.
To pool you need cells to stay active over a sequence of patterns.  The way
we achieved this is a cell first learns to predict its activity for one step
ahead on time.  But once it has learned to do that it can learn to predict
its activity for two steps ahead, etc.  By repeating patterns a cell can
learn to predict its activity well in advance.  How far in advance depends
on how predictable and varied the sequences are.

 

I don't have time to go into all the details now, but as Mike suggests, if
we have only one cell per column then the cell will pool no matter what
direction a pattern is moving.  It can't tell a left moving line from a
right moving line.  Therefore it will produce a cell that responds to a line
no matter where the line is and no matter what direction it is moving.
However, if we have multiple cells per column then it will produce a cell
that responds when a line is moving in a particular direction.  We see both
types of cell in V1 in real brains.  I have a theory (a highly speculative
theory) that Layer 4 cells are like the former and Layer 3 cells are like
the latter.  There are several lines of evidence to suggest this.   In this
case Layer 4 learns pure shift invariance but layer 3 learns true sequences.
BTW, layer 4 is large in the first couple of levels in cortex but disappears
as you ascend the hierarchy.  My explanation is as you ascend the hierarchy
spatial invariance is solved and is no longer needed.  But sequences like
melodies, language and actions continue to need the type of pooling done by
layer 3.

 

We got pooling to work in the CLA but it took a lot of synapses and
therefore memory and computation time.  In the current form of the CLA we
have sequence memory but the pooling part is deactivated.  We don't need
pooling for the types of problems we are applying Grok to.

 

One of the reasons I am hesitant to work on vision problems is that the
temporal pooling requirement is large.  Consider this, the amount of cortex
dedicated to low-level vision (areas V1 and V2) dwarfs the amount of cortex
dedicated to language (Broca's and Wernicke's areas).  Low level vision is
much harder than language.  Amazing.

Jeff

 

From: nupic [mailto:[email protected]] On Behalf Of Scott
Purdy
Sent: Wednesday, July 17, 2013 10:58 AM
To: NuPIC general mailing list.
Subject: Re: [nupic-dev] Training on Handwritten Digit Dataset using CLA

 

I was wrong about that. I don't quite understand it well enough to give a
proper response so I am going to see if Jeff can write it up.

 

The explanation I got was that you can train a temporal model by moving the
letter around the image.  And then when you give it a test image, you expect
it to predict the letter moving in different directions.  The predicted
cells are apparently useful as you move up the hierarchy.  Time acts as a
sort of supervisor for spatial invariants.

 

But like I said, I am going to try to get someone to do a better
explanation.  There was quite a lot of vision work done that would be great
to capture for you guys.

 

On Wed, Jul 17, 2013 at 8:04 AM, Quinn Liu <[email protected]> wrote:

Hi Michael and Scott,
    Thank you very much for your explanations. Michael's explanation implies
that the Temporal Pooler greatly helps in spatial invariance learning of
training data which I can see working. 

 

But for question 3 Scott has said "No need for TP. It won't help with
spatial representations." I was hoping Scott you could expand on your answer
to what you think about how SP and TP contribute to spatial invarience
recognition. 

 

Best Regards,

Quinn Liu

 

On Mon, Jul 15, 2013 at 5:07 PM, Michael Ferrier <[email protected]>
wrote:

Hi Quinn, 

 

The older version of HTM would group together the spatial patterns that
would tend to occur in close temporal sequence with one another, and produce
the same output when it saw any of the spatial patterns within a given
group. So, if a network were trained on visual input of digits zig-zagging
through the visual field, then any individual visual feature (for example a
vertical line) would come to be represented by a temporal group that
responds when it is presented with a vertical line at any of many nearby
locations, because in the training data, a vertical line is often seen
moving from one location to another nearby location. In this way it would
learn invariance to position. At the lowest level of the hierarchy it would
learn invariance to position for individual small visual features, and at
higher levels it would learn invariance for more complex and larger
arrangements of features and whole visual objects. Invariance to other
transformations like scale, rotation, etc. could also be learned this way
given the appropriate training data.

 

Like Scott said the old version of HTM worked very differently from CLA, but
they both model the same basic principles (the CLA does so much more
flexibly). Using a CLA region with one cell per column, a cell should become
active when given a particular spatial pattern, but should become predictive
when given any pattern that (during training) often occurs close by in
temporal sequence to that spatial pattern. So, if a column's proximal
segment represents the spatial pattern of a vertical line, then that
column's cell should become predictive whenever a vertical line at any
nearby position is presented, because during training a given vertical line
is often followed by another nearby vertical line, since the training set is
made up of animations of the visual objects smoothly zig-zagging around.

 

And because a CLA region sends output from both its active and predictive
cells, from the point of view of the next, higher region in the hierarchy,
that cell is responding invariantly to any of a set of nearby vertical
lines. This corresponds to how 'complex cells' respond in visual cortex.

 

Does that make sense?

 

-Mike




_____________
Michael Ferrier
Department of Cognitive, Linguistic and Psychological Sciences, Brown
University
[email protected]

 

On Mon, Jul 15, 2013 at 4:23 PM, Scott Purdy <[email protected]> wrote:

Quinn, the older HTM implementations were completely different algorithms
and are now obsolete.

 

On Mon, Jul 15, 2013 at 1:09 PM, Quinn Liu <[email protected]> wrote:

Hi Michael,

    I had an additional question. In your reply you remarked that "while
digit recognition was successfully modeled with the original version of HTM,
that doesn't seem to be the case with CLA yet". I was wondering if you or
anyone else could expand on this as I am unfamiliar with the original
version of the HTM. Assuming that it is premature version of the current
spatial and temporal learning algorithms how is it different? Thanks!

 

Best Regards,

Quinn Liu

 

[email protected]

 

On Mon, Jul 15, 2013 at 3:41 PM, Michael Ferrier <[email protected]>
wrote:

Hi Fergal,

 

I completely agree that a visual object recognition system would greatly
benefit from hierarchy. Causes in the world are hierarchical, and the brain
uses hierarchy to learn and represent them. The successful vision models
using the original implementation of HTM were also hierarchical. I was just
saying that, as far as I know, this hasn't been done with CLA yet --
according to Jeff, in their vision experiments they were just beginning to
expand beyond one layer when they stopped working on vision.

 

I think that both temporal pooling (for invariance) and hierarchy are key to
using CLA for visual recognition problems, but I don't know of anyone who
has put all the pieces together yet to do visual recognition with CLA.

 

-Mike

 

 




_____________
Michael Ferrier
Department of Cognitive, Linguistic and Psychological Sciences, Brown
University
[email protected]

 

On Mon, Jul 15, 2013 at 11:44 AM, Fergal Byrne <[email protected]>
wrote:

 

Hi Michael,

 

Handwritten characters are undoubtedly multi-component designs, which have
evolved to connect with and trigger our ability to learn spatial, temporal
and hierarchical patterns. We perceive the same characters even when loads
of things change in fonts, and especially when reading different people's
handwriting. We can fill in gaps and correct misspellings. So the learning
and prediction must be several levels deep in hierarchy.

 

In terms of bottom level mechanics, we use saccades to recognise and
"delocalise" components such as characters, facial features, etc, in such a
way as to allow this multi-level recognition (including a hierarchy of
fixations - for strokes, junctions, topology, characters, letters, words,
and even sentences). 

 

Speed-readers can saccade to read entire phrases and sentences at a time,
allowing reading speeds of thousands of words per minute with better than
70% comprehension scores. With practice, I've been able to get scores in the
1-2000 wpm range. I can also read text in a mirror or upside-down at speeds
approaching 50-60% of an average reader. These things could only be done
using big, complex region hierarchies with vast volumes of (normal) reading
practice.

 

I would have predicted that a single layer CLA would struggle with this kind
of data set, because it lacks the multi-level upward and downward structure
which I feel this kind of performance requires.

 

Regards,

 

Fergal Byrne

 

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

 


_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

 


_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

 


_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

 


_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

 


_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

 

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to