Hi Neal, I'll answer 6) first. What you're doing is not foolish; it's very ambitious and interesting, and will shed a lot of light on what kinds of problems NuPIC can address, and also point at how best to set up and use NuPIC for this kind of large problem. But, as formulated, it is not reasonable to expect a single (standard-sized) NuPIC region to handle this.
The first part of this email discusses some serious difficulties arising from your proposed approach, and this is followed by a suggested line of research which might be more fruitful (and would certainly be less hopeless!). Let's look at the problem just in terms of the numbers (ie take a brute force and ignorance approach). Your training set is 6 categories x 100 videos x 4 sec x 24 fps = 57600 frames (9600 per category) You say each frame has an average of 32 "salient points" which each have an x-y position and 64 integer dimensions of feature information. If you wish to treat each of these dimensions as necessarily precise, the number of bits per timestep (using a standard 128-bit ScalarEncoder for each value) is 32 points * (1 xpos + 1 ypos + 64 featureValues) * 128 bits per scalar = 270366 bits per frame (44,352 on) Oh, hang on. This neglects the semantic difference between the x, y values (spatial) and the other (feature) dimensions. Maybe the data should be presented as some kind of 2-D array, and use a topological setup for the SP. But then how do you represent the values for each point? Do you use 64 fields per x-y position? Do you use 64 (greyscale) maps? Um. Bailing on that one, you also have the problem of ordering the points, because presumably whatever is "in" the frame will be characterised by some ordered subset of related points moving together in some spatial relationship which evolves over time, and the ordering of the points will not remain constant as the object(s) move. This ordering of the points, so as to remain semantically constant, would have to be derived from some knowledge of the underlying object (for example, the edge points of a cube would consist of "corner 1", one or more "edge 1" points, "corner 2", and so on) in an order which traverses the six outer and three inner edges visible on a cube in a particular order. And that's just if the video is of a simple geometric object. The preceding paragraph begs the question. If you have knowledge of the semantics of the objects in the frame, then you can convert your data into "cube", "x pos", "y pos", "z pos", "size", "orientation1", ... for NuPIC, which might then be able to find some temporal patterns; in which case why would you feed it with 32 sets of 64-dimensional spatial data? On the other hand, if you don't have this knowledge, how are you going to present the points in a semantically consistent order? What happens if the number of objects changes? What happens if the objects swap positions (like binary stars)? And the goal you're setting each NuPIC model? To go "ho-hum, nothing to see here" when it's watching a particular sort of video, and "wow, this is new" otherwise... I hope you get the drift. As I said in the earlier thread, your problem starts when you put something like OpenCV in between your raw image data and a processor based on spatial-temporal pattern recognition. OpenCV is fine if you have a procedural algorithm that can analyse your list of points. But it's meaningless to give a pattern recogniser a list (whose ordering is significant and changing non-linearly) of structured data, with position merely an entry, and where each point has 64 dimensions to identify it, and where it's the spatial correlations between some (specific for each point-point relationship) ordered subset of the feature values which is pertinent to object identity. This is why one third of the human brain is dedicated to vision. That's about 250K NuPIC's, arranged in a hierarchy which has evolved over tens of millions of years (and with subcortical components hundreds of millions of years old). Of course, you can build tiny visual systems (like an insect's) if you're really clever how you program the small number of neurons. But NuPIC is a general learning system, so you need vast real estate to do anything meaningful. And finally, we don't see the whole picture evenly at all. We build models of what things are in the part of the world we're seeing, and we focus our attention on gathering specific data and updating our memories of certain features of each object. We "paint" a memory of the state of the world, and we watch out for anomalous changes in that memory. This motivates a suggested line of approach. Let's turn the thing around and treat each of the 6 models as a specialised "watcher" who learns by watching a series of 100 videos. When presented with similar content, each watcher happily predicts what it's likely to see next. To do this, it must have learned about some of the "things" in the video stream. These "things" are sets of patches and edges which evolve in some familiar way over sequences of 96 steps. There are 100 such sequences, each with its own patterns of patches and edges to be learned. Working backwards again, each "watcher" must learn to extract a representation from each frame, and these representations must change predictably. Anything which doesn't maintain spatial and temporal consistency will not be learned by the watcher, but will be ignored. Anything which does have consistency, but is not from the learning material, will generate an anomaly (otherwise the "watcher" is not specialising in its area of interest). So, our goal is to build something which can track and predict the evolution of a certain kind of visual information in the video stream. To do this, it'll have to have some "micro-saccading" ability, which will allow it to move over the video frame and keep the "centre of interest" in a constant position. For instance, if watching a video of 800m runners over 4 seconds, the "centre of interest" will be the centre of the jostling group of athletes. The image information relative to this position (with density dropping away with distance from the centre) will change in some learnable way. The idea I have is, for each "watcher", to have two components: a "cameraman" and a "viewer". The cameraman learns how to transform each frame and feed a normalised "image" to the viewer, which learns the evolution of its normalised view of the stream. In turn, the viewer will feed back anomaly information for the cameraman so it knows how well it's tracking the relevant information. When a frame is presented, by chance the cameraman may be pointing at the right spot if this is so, the viewer will recognise a familiar pattern and tell the cameraman not to move. If not, the viewer will generate an anomaly pattern which represents what is "wrong" with the current view. The cameraman will use this to move the view so as to satisfy the viewer. If this process results in no "lock-on" after a certain number of tries, the system will declare a general anomaly. Note that it's likely the viewer must be a hierarchy, not just a singe CLA layer. The cameraman might also be a hierarchy. Also, the above scenario is for a system which has learned - it's the test condition. Training such a system might prove difficult. You might need a lot more than 100 4-second videos! Thanks again for presenting us with this challenge. Regards Fergal Byrne On Thu, Nov 21, 2013 at 7:08 AM, Neal Donnelly <[email protected]> wrote: > Hey everyone, > > I wrote to the list a couple weeks ago for some advice on my project > applying NuPIC to human-action video classification. I got some good > feedback at a high level, but I'm ready to ask some much more specific > questions about field parameters. > > I'm trying to write a search_def.json to run a swarm as trying to figure > out all the model parameters manually seems hopeless. I have six different > types of videos, each of which has 100 video examples. My goal is to build > a TemporalAnomaly model for each of the six types of video so that I can > compare the anomaly score of an unknown video between models. I wrote a > script that can take any number of videos and turn them into a csv file by > identifying salient points in each frame and writing the descriptor of each > to a line. Right now my csv file looks like this > > > video_index,frame_index,x,y,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30,f31,f32,f33,f34,f35,f36,f37,f38,f39,f40,f41,f42,f43,f44,f45,f46,f47,f48,f49,f50,f51,f52,f53,f54,f55,f56,f57,f58,f59,f60,f61,f62,f63 > > int,int,float,float,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int > ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, > 0, 0, 118.0, 53.0, 6, 132, 138, 218, 203, 225, 176, 202, 190, 19, 137, > 146, 132, 154, 105, 88, 235, 147, 123, 147, 2, 7, 14, 115, 149, 6, 26, 95, > 148, 81, 45, 138 > 0, 0, 70.0, 58.0, 92, 18, 174, 164, 2, 42, 59, 80, 160, 167, 143, 189, 95, > 167, 107, 90, 55, 213, 237, 166, 8, 213, 129, 64, 209, 231, 107, 133, 160, > 243, 39, 115 > > f0 through f63 represent the 64 dimensional descriptor that describes the > feature invariant to scale, translation, rotation, and shear changes. On > average, each frame has 32 keypoints, but the number detected varies from > frame to frame. This leaves me with a set of questions, of which I'd be > happy to get answers for any. > > 1) Is it possible to have many datapoints for the same time point? From my > understanding from the HTM paper, each new data point is assumed to be a > new observation as time passes. However, if I enter all the keyframes in > parallel as one csv row, I'm concerned that the system will think the order > that they are listed in the line matters. If I had to, I would probably > rank them by something like x position. > > 2) Can I present time moving forward with an index rather than a datetime? > Creating a fictitious datetime seems hackey. > > 3) How do I differentiate between the videos? There'll be an abrupt > change, sure to trigger an anomaly, when the videos switch and there are > two discontinuous frames. Is there anything I can do to explain this to the > CLA? > > 4) I want the swarm to find a model that reaches a low anomaly score on > the set of videos to which its been exposed. I would expect to then define > "customErrorMetric": { > "customExpr": "anomalyScore", > "errorWindow": 3072 // 32 features/frame * 24 frames/sec * 4 sec > } > Is anomalyScore the right field name? Is this a good move? > > 5) I assume that predictionSteps and predictionField only serve to tell > the swarm where to look for the predicted value to conmpare to the actual. > If I define a custom error metric, are these still relevant? > > 6) Is my approach reasonable? Am I doing anything obviously foolish? > > 7) Is there any more documentation on parameter dictionaries? I've been > sort of searching the repo and the wiki but it's all rather ad hoc text > searches. Specifically, where can I find a list of the flags that go on the > third line of the CSV? > > Thanks so much! > Neal Donnelly > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > -- Fergal Byrne, Brenter IT <http://www.examsupport.ie>http://inbits.com - Better Living through Thoughtful Technology e:[email protected] t:+353 83 4214179 Formerly of Adnet [email protected] http://www.adnet.ie
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
