On Fri, Apr 5, 2013 at 2:27 AM, Ben Goertzel <[email protected]> wrote: >> Anyway, I would like opinions on the computational complexity of human >> vision. Specifically, how would you optimize Google's cat face >> recognizer and bring it up to human level? >> http://128.84.158.119/abs/1112.6209v3 > > I wouldn't try to optimize that algorithm; I would take a different > approach that couples a visual hierarchy with a structurally and > dynamically richer cognitive system... > > But I'm not going to try to pack the details of my AGI thinking into an > email...
I assume it is based on DeSTIN, which is also a hierarchical neural network. http://blog.opencog.org/2011/02/21/destin-vision-development/ http://www.aaai.org/ocs/index.php/FSS/FSS09/paper/viewFile/951/1268 I note from the 2009 paper that DeSTIN is able to distinguish between 32 x 32 x 1 images of A, B, and C with translations and added noise using an 8 layer (4 feature detection layers alternating with 4 belief layers) of size 64, 24, 16, 12, 4, 6, 1, 3 with the first layer detecting 4 x 4 non-overlapping patches. I'm not sure, but I think there are about 20K connections, mostly in the lower layers. I presume a single processor is sufficient. The paper did not indicate the number of training cycles or CPU time except to say there were 300 cycles to learn the intermediate features before training the belief nodes. The blog post from 2011 notes a GPU port is planned. Are there any new experimental results? It seems like the next logical step would be to model a fovea and saccades to reduce the input complexity, and then give it some harder problems like reading text, interpreting captchas, recognizing faces, or recognizing objects from ImageNet. That could be followed by adding depth perception and features to detect motion and using it to control robot navigation. I realize that DeSTIN and the Google system differ in size and details, but they are both hierarchical neural networks with unsupervised learning of intermediate features using winner-take-all networks or something similar. In both cases, the computational requirements depend on the size of the training set and the number of features to be detected. The Google system has 10^5 times more connections and I guess 10^5 times more training data, requiring 10^10 times as much computation. I don't know of any good estimates of the number of features in human level vision. We know there are 10^6 inputs from the optic nerve. I suppose that we can distinguish among 10^6 visual objects at the top layer. This is somewhat higher than our vocabulary. I think it has to be larger than 10^5, because language is inadequate for describing everything we can see. I can't describe a person's face in sufficient detail that you would immediately recognize them. You would need a picture. Let's assume there are about 10 layers, all the same size. Then there are about 10^13 connections. Over a few decades we receive 10^16 bits from the optic nerve at a rate of 10 bits per second per nerve fiber x 10^6 fibers x 10^9 seconds. The processing rate would be 2 x 10^14 OPS at 50 ms cycle time. That seems about right because it takes about 0.5 seconds to recognize a face. You need 40 TB of RAM to store 10^13 connections as 32 bit integers or floats. An NVIDIA Titan GPU has 2688 cores and runs at 4 TFLOPS (32 bit floats) with 6 GB memory. It costs about $1000, uses 250 watts of electricity, and plugs into a slot in a desktop PC. My simple math tells me 50 of these would give you enough CPU power but leave you short on RAM by a factor of 128. You would have to augment each card with 1 TB of external memory, but the bus bandwidth would be far too slow to access all of it every 50 ms even with a serial access pattern. Alternatively, you could put together 6000 cards for $6 million plus the interconnect hardware, and 1.5 MW electricity. This would allow you to run experiments 128 times faster than real time, processing a decade's worth of training video in about a month. I think this would be necessary in order to develop and tune the algorithm in reasonable time. I'm also assuming that RAM is accessed in sequentially in large blocks, as is typical in fully connected neural networks implemented using vector processing. Random access through pointers or sparse networks is about 50 times slower. There might be other implementations using single bits or bytes to represent synapses to save memory. I'm not sure what the speed impact would be. Do you agree with my math? I realize my estimate of 10^13 connections is 1/10 that of the cortex, but I am just estimating the vision component. -- -- Matt Mahoney, [email protected] ------------------------------------------- AGI Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424 Modify Your Subscription: https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657 Powered by Listbox: http://www.listbox.com
