Moshe, Here is a response to your question on how to represent visual information in an AGI system ... followed by some more general comments on incorporating vision in early-stage AGI systems...
Computer vision is not something I anticipate working on in the next couple years, but it's still interesting to think about. Parts of this e-mail may be a bit hard for non-Novamente people to follow, but if you read the stuff on www.realai.net, it should be possible -- and please feel free to ask questions. (After our book on Novamente comes out in 2003, all will be clear as crystal -- heh ;). And the end of the mail gets less technical. Ok... Suppose we have a system with several camera eyes as inputs. Assume the camera eyes are all located in the same physical vicinity, so that they can potentially and easily all be viewing different aspects of the same localized visual scene. I am assuming this particular setup for concreteness, not because the ideas I'm going to present apply only to this particular setup. I'll describe (sketchily) a 3-layer data representation architecture **** Layer 1 ________ In this layer, the output of each camera eye is represented as a set of 2D arrays. One of these 2D arrays will represent the raw output of the camera, the others will represent various processed versions of this raw output (e.g. fourier or wavelet transforms, filtered versions, etc.) This array, in the terminology of my e-mail last night, is "modality-correspondent." How are these arrays represented? In a neural net based AI system they are neurons arranged with a 2D-sheet topology. In Novamente, there are very many choices... to name just a few, A) PixelInstanceNodes, whose TruthValues represents stimulus... and whose location [if needed] is represented via relations like EvaluationLink location P (5,7) or B) ListLinks of SimpleTruthValues (nested ListLinks representing 2D arrays). These could be NumericalListLinks, meaning they could be stored and manipulated efficiently in the NumericArrayServer [which doesn't yet exist, but soon will] The choice between these options is basically an implementation decision, though it will have some impact on the ease of carrying out various operations on the 2D arrays, and hence will have a cognitive impact as well. My guess is that our first try would be with B because of the greater efficiency of carrying out low-level transformations... Layer 2 ________ In this layer, we want to represent 3D scenes, and time series of 3D scenes. This layer, in the terminology of my e-mail last night, is "world-correspondent", meaning that it corresponds to a projection of the physical world, but NOT directly to the stimuli coming in on any sense organ. So we want -- among other things -- a 3D array representation, each element of the array representing a pixel. For "animations" we have arrays of 3D arrays... In a neural net architecture, there are several well-known strategies for representation in terms of neurons.... In Novamente, we again have several choices for implementation, including a PixelInstanceNode based repr., and a NumericalListLink based repr. [Note that here the weight-of-evidence component of the TruthValue of a PixelInstanceNode is very important, because only some of the 3D scene will be inferrable with any confidence. Some parts of it -- e.g. the backs of object, or obscured regions -- will only be inferrable very conjecturally.] One could also do a lot of other things. One could do curve-fitting with spline curves, if one wanted, and represent these spline curves (in Novamente as SchemaInstanceNodes relating PixelInstanceNodes and NumberNodes). This would represent an heuristic assumption that low-level representation of smooth curves (within space or over time as well)is a useful thing. Alternately, one could do fractal curve-fitting a la Michael Barnsley, which represents a different heuristic assumption.... One way or another, these would represent a set of relationships *about* 3D/4D arrays... meaning that for interpreting them one still wants to have schemata representing 3D/4D array operations (rotations, translations etc.) easily available.... Layer 3 -------- In this layer, we represent cognitive structures. It's just networks of relationships, represented however one represents relationships in one's AGI system. Heuristic knowledge of the physical world & its visual representation will obviously be useful for processing such relationships. Such heuristic knowledge however should be inferrable from patterns in the 3D/4D scenes in Layer 2. Finally, a comment on parameter-tuning and procedure learning. An architecture like this is going to have LOTs of free parameters, and it also requires a lot of specialized processing schemata at all levels. It definitely makes sense to tune the parameters via optimization, where the objective function is whole-system achievement of environment-dependent goals using the parameter values. I suspect it also makes sense to have many of the specialized processing schemata learned via optimization, with the same systemic-goal-achievement fitness function. In this case, one gives the system a multilayer architecture & data repr. to work with, but one "teaches" it (via evolutionary or probabilistic optimization methods) the processing algorithms that need to reside at each layer). ***** Now, one could potentially create a super-smart AGI without any vision at all, then give it eyes, and watch it *discover* an architecture like the above on its own. This would be sort of like a human scientist creating new sensory organs for himself and then using science to create a new sensory cortex for himself to go with it. But given how important vision is to humans, it may well be interesting to actually build a vision lobe for an AGI system prior to the super-smart stage (i.e. prior to the human intelligence stage). In my approach to AGI, the value to vision-lobe construction depends on the answer to the question: "Once we build an AGI with a reasonably high degree of intelligence and no visual sensors, how long does it take for this AGI to get super-smart via self-modification?" But my approach isn't vision centric. There are at least two possible reasons for taking a more aggressive approach to implementing vision: 1) If you believe Keith Hoyes, then visual metaphors are SUCH a powerful heuristic that one would be dumb not to incorporate vision in one's first version of an AGI system. 2) There's an alternate argument that humanlike sensation/action is so important to human language, and understanding human language is so useful to a growing AGI, that it makes sense to give a humanlike AI humanlike sensation/action as soon as possible, just to maximally facilitate its communication with humans. I find this second argument more convincing than the former.... But still, I prefer to proceed first by creating a system without vision, and add vision only when time/necessity seem to dictate. I would rather add *touch* and *hearing* than vision first, if I'm going to add physical sensors to Novamente. Now, simulating touch in an AGI system is a VERY interesting issue to me. Because touch ties directly into the sense of physical self that we humans have so naturally (and that develops in us during early infancy...) And, Moshe, you hint at adding simulated physical sensors rather than actual ones. I am skeptical of this in the short run, because there is no simulated physical environment containing anywhere near the richness of the real-world perceptual environment. [The Net contains loads of data, but, not so much simulated sensory data...] -- Ben G ------- To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/
