Re: Complexity of vision (was Re: [agi] Utilizing kickstarter.com?)

Matt Mahoney Fri, 05 Apr 2013 09:37:23 -0700

On Fri, Apr 5, 2013 at 2:27 AM, Ben Goertzel <[email protected]> wrote:
>> Anyway, I would like opinions on the computational complexity of human
>> vision. Specifically, how would you optimize Google's cat face
>> recognizer and bring it up to human level?
>> http://128.84.158.119/abs/1112.6209v3
>
> I wouldn't try to optimize that algorithm; I would take a different
> approach that couples a visual hierarchy with a structurally and
> dynamically richer cognitive system...
>
> But I'm not going to try to pack the details of my AGI thinking into an 
> email...


I assume it is based on DeSTIN, which is also a hierarchical neural network.
http://blog.opencog.org/2011/02/21/destin-vision-development/
http://www.aaai.org/ocs/index.php/FSS/FSS09/paper/viewFile/951/1268

I note from the 2009 paper that DeSTIN is able to distinguish between
32 x 32 x 1 images of A, B, and C with translations and added noise
using an 8 layer (4 feature detection layers alternating with 4 belief
layers) of size 64, 24, 16, 12, 4, 6, 1, 3 with the first layer
detecting 4 x 4 non-overlapping patches. I'm not sure, but I think
there are about 20K connections, mostly in the lower layers. I presume
a single processor is sufficient. The paper did not indicate the
number of training cycles or CPU time except to say there were 300
cycles to learn the intermediate features before training the belief
nodes.

The blog post from 2011 notes a GPU port is planned. Are there any new
experimental results?

It seems like the next logical step would be to model a fovea and
saccades to reduce the input complexity, and then give it some harder
problems like reading text, interpreting captchas, recognizing faces,
or recognizing objects from ImageNet. That could be followed by adding
depth perception and features to detect motion and using it to control
robot navigation.

I realize that DeSTIN and the Google system differ in size and
details, but they are both hierarchical neural networks with
unsupervised learning of intermediate features using winner-take-all
networks or something similar. In both cases, the computational
requirements depend on the size of the training set and the number of
features to be detected. The Google system has 10^5 times more
connections and I guess 10^5 times more training data, requiring 10^10
times as much computation.

I don't know of any good estimates of the number of features in human
level vision. We know there are 10^6 inputs from the optic nerve. I
suppose that we can distinguish among 10^6 visual objects at the top
layer. This is somewhat higher than our vocabulary. I think it has to
be larger than 10^5, because language is inadequate for describing
everything we can see. I can't describe a person's face in sufficient
detail that you would immediately recognize them. You would need a
picture.

Let's assume there are about 10 layers, all the same size. Then there
are about 10^13 connections. Over a few decades we receive 10^16 bits
from the optic nerve at a rate of 10 bits per second per nerve fiber x
10^6 fibers x 10^9 seconds. The processing rate would be 2 x 10^14 OPS
at 50 ms cycle time. That seems about right because it takes about 0.5
seconds to recognize a face. You need 40 TB of RAM to store 10^13
connections as 32 bit integers or floats.

An NVIDIA Titan GPU has 2688 cores and runs at 4 TFLOPS (32 bit
floats) with 6 GB memory. It costs about $1000, uses 250 watts of
electricity, and plugs into a slot in a desktop PC. My simple math
tells me 50 of these would give you enough CPU power but leave you
short on RAM by a factor of 128. You would have to augment each card
with 1 TB of external memory, but the bus bandwidth would be far too
slow to access all of it every 50 ms even with a serial access
pattern. Alternatively, you could put together 6000 cards for $6
million plus the interconnect hardware, and 1.5 MW electricity. This
would allow you to run experiments 128 times faster than real time,
processing a decade's worth of training video in about a month. I
think this would be necessary in order to develop and tune the
algorithm in reasonable time.

I'm also assuming that RAM is accessed in sequentially in large
blocks, as is typical in fully connected neural networks implemented
using vector processing. Random access through pointers or sparse
networks is about 50 times slower. There might be other
implementations using single bits or bytes to represent synapses to
save memory. I'm not sure what the speed impact would be.

Do you agree with my math? I realize my estimate of 10^13 connections
is 1/10 that of the cortex, but I am just estimating the vision
component.

--
-- Matt Mahoney, [email protected]


-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Re: Complexity of vision (was Re: [agi] Utilizing kickstarter.com?)

Reply via email to