Hi Neal,

I'll answer 6) first. What you're doing is not foolish; it's very ambitious
and interesting, and will shed a lot of light on what kinds of problems
NuPIC can address, and also point at how best to set up and use NuPIC for
this kind of large problem. But, as formulated, it is not reasonable to
expect a single (standard-sized) NuPIC region to handle this.

The first part of this email discusses some serious difficulties arising
from your proposed approach, and this is followed by a suggested line of
research which might be more fruitful (and would certainly be less
hopeless!).

Let's look at the problem just in terms of the numbers (ie take a brute
force and ignorance approach).

Your training set is

6 categories x 100 videos x 4 sec x 24 fps = 57600 frames (9600 per
category)

You say each frame has an average of 32 "salient points" which each have an
x-y position and 64 integer dimensions of feature information. If you wish
to treat each of these dimensions as necessarily precise, the number of
bits per timestep (using a standard 128-bit ScalarEncoder for each value) is

32 points * (1 xpos + 1 ypos + 64 featureValues) * 128 bits per scalar =
270366 bits per frame (44,352 on)

Oh, hang on.

This neglects the semantic difference between the x, y values (spatial) and
the other (feature) dimensions. Maybe the data should be presented as some
kind of 2-D array, and use a topological setup for the SP. But then how do
you represent the values for each point? Do you use 64 fields per x-y
position? Do you use 64 (greyscale) maps? Um.

Bailing on that one, you also have the problem of ordering the points,
because presumably whatever is "in" the frame will be characterised by some
ordered subset of related points moving together in some spatial
relationship which evolves over time, and the ordering of the points will
not remain constant as the object(s) move. This ordering of the points, so
as to remain semantically constant, would have to be derived from some
knowledge of the underlying object (for example, the edge points of a cube
would consist of "corner 1", one or more "edge 1" points, "corner 2", and
so on) in an order which traverses the six outer and three inner edges
visible on a cube in a particular order. And that's just if the video is of
a simple geometric object.

The preceding paragraph begs the question. If you have knowledge of the
semantics of the objects in the frame, then you can convert your data into
"cube", "x pos", "y pos", "z pos", "size", "orientation1", ... for NuPIC,
which might then be able to find some temporal patterns; in which case why
would you feed it with 32 sets of 64-dimensional spatial data? On the other
hand, if you don't have this knowledge, how are you going to present the
points in a semantically consistent order?

What happens if the number of objects changes? What happens if the objects
swap positions (like binary stars)?

And the goal you're setting each NuPIC model? To go "ho-hum, nothing to see
here" when it's watching a particular sort of video, and "wow, this is new"
otherwise...

I hope you get the drift.

As I said in the earlier thread, your problem starts when you put something
like OpenCV in between your raw image data and a processor based on
spatial-temporal pattern recognition. OpenCV is fine if you have a
procedural algorithm that can analyse your list of points. But it's
meaningless to give a pattern recogniser a list (whose ordering is
significant and changing non-linearly) of structured data, with position
merely an entry, and where each point has 64 dimensions to identify it, and
where it's the spatial correlations between some (specific for each
point-point relationship) ordered subset of the feature values which is
pertinent to object identity.

This is why one third of the human brain is dedicated to vision. That's
about 250K NuPIC's, arranged in a hierarchy which has evolved over tens of
millions of years (and with subcortical components hundreds of millions of
years old).

Of course, you can build tiny visual systems (like an insect's) if you're
really clever how you program the small number of neurons. But NuPIC is a
general learning system, so you need vast real estate to do anything
meaningful.

And finally, we don't see the whole picture evenly at all. We build models
of what things are in the part of the world we're seeing, and we focus our
attention on gathering specific data and updating our memories of certain
features of each object. We "paint" a memory of the state of the world, and
we watch out for anomalous changes in that memory.

This motivates a suggested line of approach.

Let's turn the thing around and treat each of the 6 models as a specialised
"watcher" who learns by watching a series of 100 videos. When presented
with similar content, each watcher happily predicts what it's likely to see
next. To do this, it must have learned about some of the "things" in the
video stream. These "things" are sets of patches and edges which evolve in
some familiar way over sequences of 96 steps. There are 100 such sequences,
each with its own patterns of patches and edges to be learned.

Working backwards again, each "watcher" must learn to extract a
representation from each frame, and these representations must change
predictably. Anything which doesn't maintain spatial and temporal
consistency will not be learned by the watcher, but will be ignored.
Anything which does have consistency, but is not from the learning
material, will generate an anomaly (otherwise the "watcher" is not
specialising in its area of interest).

So, our goal is to build something which can track and predict the
evolution of a certain kind of visual information in the video stream. To
do this, it'll have to have some "micro-saccading" ability, which will
allow it to move over the video frame and keep the "centre of interest" in
a constant position. For instance, if watching a video of 800m runners over
4 seconds, the "centre of interest" will be the centre of the jostling
group of athletes. The image information relative to this position (with
density dropping away with distance from the centre) will change in some
learnable way.

The idea I have is, for each "watcher", to have two components: a
"cameraman" and a "viewer". The cameraman learns how to transform each
frame and feed a normalised "image" to the viewer, which learns the
evolution of its normalised view of the stream. In turn, the viewer will
feed back anomaly information for the cameraman so it knows how well it's
tracking the relevant information.

When a frame is presented, by chance the cameraman may be pointing at the
right spot if this is so, the viewer will recognise a familiar pattern and
tell the cameraman not to move. If not, the viewer will generate an anomaly
pattern which represents what is "wrong" with the current view. The
cameraman will use this to move the view so as to satisfy the viewer. If
this process results in no "lock-on" after a certain number of tries, the
system will declare a general anomaly.

Note that it's likely the viewer must be a hierarchy, not just a singe CLA
layer. The cameraman might also be a hierarchy. Also, the above scenario is
for a system which has learned - it's the test condition. Training such a
system might prove difficult. You might need a lot more than 100 4-second
videos!

Thanks again for presenting us with this challenge.

Regards

Fergal Byrne



On Thu, Nov 21, 2013 at 7:08 AM, Neal Donnelly <[email protected]> wrote:

> Hey everyone,
>
> I wrote to the list a couple weeks ago for some advice on my project
> applying NuPIC to human-action video classification. I got some good
> feedback at a high level, but I'm ready to ask some much more specific
> questions about field parameters.
>
> I'm trying to write a search_def.json to run a swarm as trying to figure
> out all the model parameters manually seems hopeless. I have six different
> types of videos, each of which has 100 video examples. My goal is to build
> a TemporalAnomaly model for each of the six types of video so that I can
> compare the anomaly score of an unknown video between models. I wrote a
> script that can take any number of videos and turn them into a csv file by
> identifying salient points in each frame and writing the descriptor of each
> to a line. Right now my csv file looks like this
>
>
> video_index,frame_index,x,y,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30,f31,f32,f33,f34,f35,f36,f37,f38,f39,f40,f41,f42,f43,f44,f45,f46,f47,f48,f49,f50,f51,f52,f53,f54,f55,f56,f57,f58,f59,f60,f61,f62,f63
>
> int,int,float,float,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int
> ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
> 0, 0, 118.0, 53.0, 6, 132, 138, 218, 203, 225, 176, 202, 190, 19, 137,
> 146, 132, 154, 105, 88, 235, 147, 123, 147, 2, 7, 14, 115, 149, 6, 26, 95,
> 148, 81, 45, 138
> 0, 0, 70.0, 58.0, 92, 18, 174, 164, 2, 42, 59, 80, 160, 167, 143, 189, 95,
> 167, 107, 90, 55, 213, 237, 166, 8, 213, 129, 64, 209, 231, 107, 133, 160,
> 243, 39, 115
>
> f0 through f63 represent the 64 dimensional descriptor that describes the
> feature invariant to scale, translation, rotation, and shear changes. On
> average, each frame has 32 keypoints, but the number detected varies from
> frame to frame. This leaves me with a set of questions, of which I'd be
> happy to get answers for any.
>
> 1) Is it possible to have many datapoints for the same time point? From my
> understanding from the HTM paper, each new data point is assumed to be a
> new observation as time passes. However, if I enter all the keyframes in
> parallel as one csv row, I'm concerned that the system will think the order
> that they are listed in the line matters. If I had to, I would probably
> rank them by something like x position.
>
> 2) Can I present time moving forward with an index rather than a datetime?
> Creating a fictitious datetime seems hackey.
>
> 3) How do I differentiate between the videos? There'll be an abrupt
> change, sure to trigger an anomaly, when the videos switch and there are
> two discontinuous frames. Is there anything I can do to explain this to the
> CLA?
>
> 4) I want the swarm to find a model that reaches a low anomaly score on
> the set of videos to which its been exposed. I would expect to then define
>     "customErrorMetric": {
>       "customExpr": "anomalyScore",
>       "errorWindow": 3072 // 32 features/frame * 24 frames/sec * 4 sec
>     }
> Is anomalyScore the right field name? Is this a good move?
>
> 5) I assume that predictionSteps and predictionField only serve to tell
> the swarm where to look for the predicted value to conmpare to the actual.
> If I define a custom error metric, are these still relevant?
>
> 6) Is my approach reasonable? Am I doing anything obviously foolish?
>
> 7) Is there any more documentation on parameter dictionaries? I've been
> sort of searching the repo and the wiki but it's all rather ad hoc text
> searches. Specifically, where can I find a list of the flags that go on the
> third line of the CSV?
>
> Thanks so much!
> Neal Donnelly
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>


-- 

Fergal Byrne, Brenter IT

<http://www.examsupport.ie>http://inbits.com - Better Living through
Thoughtful Technology

e:[email protected] t:+353 83 4214179
Formerly of Adnet [email protected] http://www.adnet.ie
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to