RE: [agi] IBM, Los Alamos scientists claim fastest computer

Ed Porter Thu, 12 Jun 2008 14:17:24 -0700

I think processor to memory, and inter processor communications are
currently far short

-----Original Message-----
From: Matt Mahoney [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 12, 2008 12:33 PM
To: [email protected]
Subject: RE: [agi] IBM, Los Alamos scientists claim fastest computer

> Matt Mahoney >##########>> 
I think the ratio of processing power to memory to bandwidth is just about
right for AGI. 

Ed Porter >##########>> 
I tend to think otherwise. I think the current processor-to-RAM and
processor-to-processor bandwidths are too low.  

(PLEASE CORRECT ME IF YOU THINK ANY OF MY BELOW CALCULATIONS OR STATEMENTS
ARE INCORRECT)

The average synapse fires over once per second on average. The brain has
roughly 10^12 - 10^15 synapses (the lower figure is based on some peoples'
claim that only 1% of synapses are really effective).  Since each synapse
activation involve at least two or more memory accesses (at least a
read-modify-write) that would involve roughly a similar number of memory
accesses per second.  Because of the high degree of irregularity and
non-locality of connections in the brain, many of such accesses would have
to be modeled by non-sequential RAM accesses.  Since --- as is stated below
in more detail --- a current processor can only average roughly about 10^7
non-sequential read-modify-writes per second, that means 10^5 - 10^8
processors would be required just to access RAM at the same rate the brain
accesses memory at its synapses, with 10^5 probably being a low number.  

But a significant number of the equivalent of synapse activations would
require inter-processor communication in an AGI made out of current computer
hardware. If one has only on the order of 10^5 processors, load balancing
becomes an issue.  And to minimize this you actually want a fair amount of
non-locality of memory. (For example, when they put Shastri's Shruiti
cognitive architecture on a Thinking Machine, they purposely randomized the
distribution of data across the machines memory to promote load balancing.)
(Load balancing is not an issue in the brain, since the brain has the
equivalent a simple, but parallel, processor for its equivalent of roughly
every 100 to 10K synapses.)  Thus, you are probably talking in terms of
needing to be able to send something in the rough ball park of 10^9 to 10^12
short, inter-processor messages a second.  To do this without having
congestion problems, you are probably going to need a theoretical bandwidth
5 to 10 times that.

One piece of hardware that would be a great machine to run test AGI's on is
the roughly $60M TACC Ranger supercomputer in Austin, TX.  It includes
15,700 AMD quadcores, for over 63K cores, and about 100TB of RAM.  Most
importantly it has Sun's very powerful Constellation system switch with 3456
(an easy to remember number) , 20Gbit infiniband bi-directional ports, which
is a theoretical x-secontional bandwidth of roughly 6.9TByte/sec.  If the
average spreading activation message were 32bytes, and if they were packed
into larger blocks to reduce per/msg costs, if and, and you assumed roughly
only 10 percent of the total capacity was used on average to prevent
congestion, that would allow roughly 20 billion global messages a second,
with each of the 3456 roughly quad core nodes receiving about 5 million per
second. 

(If any body has any info on how many random memory accesses a
quad-processor quad-core node can do/sec, I would be very interested --- I
am guessing between 80 to 320 million/sec)

I would not be surprised if the Ranger's inter-processor and
processor-to-RAM bandwidth is one or two orders of magnitude too low for
many types of human level thinking, but it would certainly be enough to do
very valuable AGI research, and to build powerful intelligences that would
be in many ways more powerful than human.

> Matt Mahoney >##########>>
Processing power and memory increase at about the same rate under Moore's
Law. 

Ed Porter >##########>> 
Yes, but the frequency of non-sequential processor-to-memory accesses has
increased much more slowly. ( This may change in the future with the
development of the type of massively multi-core chips, with built in high
bandwidth mesh networks, with, say, 10 RAM layers over each processor, in
which the layers of each such chip are connected with through silicon vias
that Sam Adams says he is now working on.  Hopefully, each such multi-layer
chips will be connected with hundreds of high bandwidth communication
channels, could help change this.  So also could processor-in-memory chips.)

> Matt Mahoney >##########>>The time it takes a modern computer to clear all
of its memory is on the same order as the response time as a neuron, and
this has not changed much since ENIAC and the Commodore 64. It would seem
easier to increase processing density than memory density but we are
constrained by power consumption, heat dissipation, network bandwidth, and
the lack of software and algorithms for parallel computation.

Bandwidth is about right too. A modern PC can simulate about 1 mm^3 of brain
tissue with 10^9 synapses at 0.1 ms resolution or so. 

Ed Porter >##########>> 
I think a PC is too slow to do the modeling you say.  This is because of
processor-to-RAM latencies.  As of 1999 a 730 Mhz PC could only
non-sequentially access RAM at about 10mhz.  Today, I think the latencies
are still at least 30ns, and the transfer of a 64 byte cache line on a 64
bit wide bus would take 8 clock edges, or four 2.5ns  400mHz bus clock
cycles, for another 10ns.  So the limit would probably still be under 25M
random accesses per second today.  There are other delays, such as the time
required (1) to find if the desired memory is in L1 and then L2 cache, (2)
to do a virtual address translation, and (3) to pass the memory request and
then the returned data through the chip's bus interface --- which I have not
included.  Multi-core or hyperthreading, might produce some improvement by
overlapping the latencies, but at best you would not be acceding more than
40M random memory access per sec. (If any one has hard figures on these
number's please tell me.)

Since you are assuming modeling 10^9 synapse, and since the connection are
highly irregular, that means you would be doing most of you synapse updating
as non-sequential RAM accesses.  Since such an update would require at least
a read and a write, you could only do roughly 1 to 2 x 10^7 per second.
This is optimistic and it is close to two orders of magnitude slower than
what you have said above. 

> Matt Mahoney >##########>>
Nerve fibers have a diameter around 1 or 2 microns, so a 1 mm cube would
have about 10^6 of these transmitting 10 bits per second, or 10 Mb/s.

Ed Porter >##########>> 
I will assume your calculation is correct.  But since the cortex is roughly
1600 cm^2 in area, there would be roughly 160,000 of these 1mm^3 cubes (and
corresponding PC's in your calculation) each with your above state 10Mb/s of
bandwidth, for a total inter-cube bandwidth of 1.6 x 10^12b/s.  It should be
noted that since each cube would be attached to many other cubes, the
bandwidth required if such interconnect were to be modeled by a simple mesh
would be even higher. 

But as I have said above, you would really need about 50 to 100
interconnected pc's to have the memory accessing capability of the 1mm^3
cube. And you would need roughly similar density of interconnect between
them, arguably multiplying the total inter-processor interconnect fifty
fold, to roughly 8 to 16 x 10^13 b/s

Plus, it must be remembered that much of the information contained in a
synapse or nerve fiber is contained by the location and strength and type
(such as neuro-transmitter) of its connections.  This is the equivalent to
probably at least a 4 or 8 byte address associated with every message. And
at least a byte or so associated with strengths and connection type.  If you
have one message a second that is an extra 40 or 80 bits added to the 10
bit/sec per nerve fiber mentioned above.

> Matt Mahoney >##########>>
Similar calculations for larger cubes show locality with bandwidth growing
at O(n^2/3). This could be handled by an Ethernet cluster with a high speed
core using off the shelf hardware.

Ed Porter >##########>> 
But as stated above the brain has very fined grain processing power, and a
high ratio of processing units to memory, so it can take advantage of
locality of memory without having the load balancing problems most current
computer hardware equivalents would have, so these locality of reference
calculations would probably have to be altered accordingly.

Matt Mahoney >##########>> 
I don't know if it is coincidence that these 3 technologies are in the right
ratio, or if it driven by the needs of software that compliment the human
mind.

-- Matt Mahoney, [EMAIL PROTECTED]

Ed Porter >##########>>

NET-NET:

There are possible savings in terms of the amount of hardware that might be
required to emulate human level though in computer hardware, that might
result from certain ways in which computer hardware greatly outperforms
wetware --- such as much more crisp memory, the ability to perform long
sequences of operations with much greater exactness, etc., a greater ability
to rapidly load and save memory, and to rapidly change function through
programability.

But except for the possibilities of such savings --- it would appears the
amount of current style computer hardware required to produced human level
though is substantially greater than your above estimate indicates.

The human brain has been brilliantly designed by evolution to perform
massive spreading activation computing in an extremely efficient manner. 

--- On Thu, 6/12/08, Derek Zahn <[EMAIL PROTECTED]> wrote:
From: Derek Zahn <[EMAIL PROTECTED]>
Subject: RE: [agi] IBM, Los Alamos scientists claim fastest computer
To: [email protected]
Date: Thursday, June 12, 2008, 11:36 AM

Two things I think are interesting about these trends in high-performance
commodity hardware:

1) The "flops/bit" ratio (processing power vs memory) is skyrocketing.  The
move to parallel architectures makes the number of high-level "operations"
per transistor go up, but bits of memory per transistor in large memory
circuits doesn't go up.  The old "bit per op/s" or "byte per op/s" rules of
thumb get really broken on things like Tesla (0.03 bit/flops).  Of course we
don't know the ratio needed for de novo AGI or brain modeling, but the
assumptions about processing vs memory certainly seem to be changing.

2) Much more than previously, effective utilization of processor operations
requires incredibly high locality (processing cores only have immediate
access to very small memories).  This is also referred to as "arithmetic
intensity".  This of course is because parallelism causes "operations per
second" to expand much faster than methods for increasing memory bandwidth
to large banks.  Perhaps future 3D layering techniques will help with this
problem, but for now AGI paradigms hoping to cache in (yuk yuk) on these
hyperincreases in FLOPS need to be geared to high arithmetic intensity.

Interestingly (to me), these two things both imply to me that we get to
increase the complexity of neuron and synapse models beyond the
"muladd/synapse + simple activation function" model with essentially no
degradation in performance since the bandwidth of propagating values between
neurons is the bottleneck much more than local processing inside the neuron
model.

-------------------------------------------
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription:
http://www.listbox.com/member/?&;
Powered by Listbox: http://www.listbox.com

-------------------------------------------
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244&id_secret=103754539-40ed26
Powered by Listbox: http://www.listbox.com

<<attachment: winmail.dat>>

RE: [agi] IBM, Los Alamos scientists claim fastest computer

Reply via email to