RE: [agi] Spatial Reasoning: Modal or Amodal?

Ben Goertzel Thu, 31 Oct 2002 05:13:19 -0800

Moshe,

Here is a response to your question on how to represent visual information
in an AGI  system ... followed by some more general comments on
incorporating vision in  early-stage AGI systems...


Computer vision is not something I anticipate working on in the next couple
years, but it's  still interesting to think about.

Parts of this e-mail may be a bit hard for non-Novamente people to follow,
but if  you read the stuff on www.realai.net, it should be possible -- and
please feel free  to ask questions.  (After our book on Novamente comes out
in 2003, all will be clear  as crystal -- heh ;).  And the end of the mail
gets less technical.


Ok...  Suppose we have a system with several camera eyes as inputs.  Assume
the  camera eyes are all located in the same physical vicinity, so that they
can  potentially and easily all be viewing different aspects of the same
localized visual  scene.

I am assuming this particular setup for concreteness, not because the ideas
I'm  going to present apply only to this particular setup.

I'll describe (sketchily) a 3-layer data representation architecture


****
Layer 1
________

In this layer, the output of each camera eye is represented as a set of 2D
arrays.

One of these 2D arrays will represent the raw output of the camera, the
others will  represent various processed versions of this raw output (e.g.
fourier or wavelet  transforms, filtered versions, etc.)

This array, in the terminology of my e-mail last night, is
"modality-correspondent."

How are these arrays represented?  In a neural net based AI system they are
neurons  arranged with a 2D-sheet topology.  In Novamente, there are very
many choices... to  name just a few,

A) PixelInstanceNodes, whose TruthValues represents stimulus... and whose
location  [if needed] is represented via relations like

EvaluationLink location P (5,7)



or

B) ListLinks of SimpleTruthValues (nested ListLinks representing 2D arrays).
These  could be NumericalListLinks, meaning they could be stored and
manipulated  efficiently in the NumericArrayServer [which doesn't yet exist,
but soon will]


The choice between these options is basically an implementation decision,
though it  will have some impact on the ease of carrying out various
operations on the 2D  arrays, and hence will have a cognitive impact as
well.

My guess is that our first try would be with B because of the greater
efficiency of  carrying out low-level transformations...


Layer 2
________

In this layer, we want to represent 3D scenes, and time series of 3D scenes.

This layer, in the terminology of my e-mail last night, is
"world-correspondent",  meaning that it corresponds to a projection of the
physical world, but NOT directly  to the stimuli coming in on any sense
organ.

So we want -- among other things -- a 3D array representation, each element
of the  array representing a pixel.

For "animations" we have arrays of 3D arrays...

In a neural net architecture, there are several well-known strategies for
representation in terms of neurons....

In Novamente, we again have several choices for implementation, including a
PixelInstanceNode based repr., and a NumericalListLink based repr.

[Note that here the weight-of-evidence component of the TruthValue of a
PixelInstanceNode is very important, because only some of the 3D scene will
be  inferrable with any confidence.  Some parts of it -- e.g. the backs of
object, or  obscured regions -- will only be inferrable very conjecturally.]

One could also do a lot of other things.  One could do curve-fitting with
spline  curves, if one wanted, and represent these spline curves (in
Novamente as  SchemaInstanceNodes relating PixelInstanceNodes and
NumberNodes).  This would  represent an heuristic assumption that low-level
representation of smooth curves  (within space or over time as well)is a
useful thing.  Alternately, one could do  fractal curve-fitting a la Michael
Barnsley, which represents a different heuristic  assumption....  One way or
another, these would represent a set of relationships  *about* 3D/4D
arrays... meaning that for interpreting them one still wants to have
schemata representing 3D/4D array operations (rotations, translations etc.)
easily  available....

Layer 3
--------

In this layer, we represent cognitive structures.  It's just networks of
relationships, represented however one represents relationships in one's AGI
system.

Heuristic knowledge of the physical world & its visual representation will
obviously  be useful for processing such relationships.  Such heuristic
knowledge however  should be inferrable from patterns in the 3D/4D scenes in
Layer 2.


Finally, a comment on parameter-tuning and procedure learning.  An
architecture like  this is going to have LOTs of free parameters, and it
also requires a lot of  specialized processing schemata at all levels.  It
definitely makes sense to tune  the parameters via optimization, where the
objective function is whole-system  achievement of environment-dependent
goals using the parameter values.   I suspect  it also makes sense to have
many of the specialized processing schemata learned via  optimization, with
the same systemic-goal-achievement fitness function.  In this  case, one
gives the system a multilayer architecture & data repr. to work with, but
one "teaches" it (via evolutionary or probabilistic optimization methods)
the  processing algorithms that need to reside at each layer).
*****




Now, one could potentially create a super-smart AGI without any vision at
all, then  give it eyes, and watch it *discover* an architecture like the
above on its own.   This would be sort of like a human scientist creating
new sensory organs for himself  and then using science to create a new
sensory cortex for himself to go with it.

But given how important vision is to humans, it may well be interesting to
actually  build a vision lobe for an AGI system prior to the super-smart
stage (i.e. prior to  the human intelligence stage).

In my approach to AGI, the value to vision-lobe construction depends on the
answer  to the question: "Once we build an AGI with a reasonably high degree
of intelligence  and no visual sensors, how long does it take for this AGI
to get super-smart via  self-modification?"

But my approach isn't vision centric.  There are at least two possible
reasons for  taking a more aggressive approach to implementing vision:

1) If you believe Keith Hoyes, then visual metaphors are SUCH a powerful
heuristic  that one would be dumb not to incorporate vision in one's first
version of an AGI  system.

2) There's an alternate argument that humanlike sensation/action is so
important to  human language, and understanding human language is so useful
to a growing AGI, that  it makes sense to give a humanlike AI humanlike
sensation/action as soon as  possible, just to maximally facilitate its
communication with humans.

I find this second argument more convincing than the former....  But still,
I prefer  to proceed first by creating a system without vision, and add
vision only when  time/necessity seem to dictate.

I would rather add *touch* and *hearing* than vision first, if I'm going to
add  physical sensors to Novamente.

Now, simulating touch in an AGI system is a VERY interesting issue to me.
Because  touch ties directly into the sense of physical self that we humans
have so naturally  (and that develops in us during early infancy...)

And, Moshe, you hint at adding simulated physical sensors rather than actual
ones.   I am skeptical of this in the short run, because there is no
simulated physical  environment containing anywhere near the richness of the
real-world perceptual  environment.  [The Net contains loads of data, but,
not so much simulated sensory  data...]


-- Ben G







-------
To unsubscribe, change your address, or temporarily deactivate your subscription, 
please go to http://v2.listbox.com/member/

RE: [agi] Spatial Reasoning: Modal or Amodal?

Reply via email to