RE: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-14 Thread Ed Porter
Matt,  

As acknowledged before, your notion of using a single matrix to represent a
neural net with say 10^5 neurons, each with up to 10^4 connections, has the
benefit of allowing for much more efficient processing, since the rows could
be read serially, which allows much faster memory access, and the output
from the processing along each row that is contributed to column values,
could be stored largely in cache, allowing for relatively rapid storage,
even if the representations of contacts are relatively sparse along each
row, and since the equivalent of address representations of contacts could
be represented much more efficiently.  I had known neural nets were often
represented by matrices, but It is still not totally clear to me that it
would work well with the type of system I am interested in.  But it might.


-Original Message-
From: Matt Mahoney [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 13, 2008 2:30 PM
To: agi@v2.listbox.com
Subject: RE: [agi] IBM, Los Alamos scientists claim fastest computer


With regard to representing different types of synapses (various time
delays, strength bounds, learning rates, etc), this information can be
recorded as characteristics of the input and output neurons and derived as
needed to save space.

[Ed Porter] I still think you are going to need multi-bit weights at
row-column element in the matrix -- since most all representations of
synapses I have seen have assumed a weight having at least 6 bits of
information, and there is reason to think you need to store both a short
term and a long term value, since it seems to me it is necessary for the
temporal correlation component of Hebbian learning, and to represent the
state information that is short-term memory.  

[Ed Porter] I don't understand how this information could be stored or
accessed more efficiently than in row-column matrix elements, so they would
be available when computing a row's contribution to each successive column.
This would include any information such as whether excitatory or inhibitory,
delays, strength bounds, learning rates, unless you assume such information
is uniform in each row or each column.  (In the brain, this may be the case
for some attributes --such as, I think, having separate neurons for
inhibition and excitation, but it would appears to require added complexity
for many learning tasks).

Minimizing inter-processor communication is a harder problem. This can be
done by mapping the neural network into a hierarchical organization so that
groups of co-located neurons are forced to communicate with other groups
through narrow channels using a small number of neurons. We know that many
problems can be solve this way. For example, a semantic language model made
of a 20K by 20K word association matrix can be represented using singular
value decomposition as a 3 layer neural network with about 100 to 200 hidden
neurons [1,2]. The two weight matrices could then be implemented on separate
processors which communicate through the hidden layer neurons. 

 
[Ed Porter] Having matrices for connecting the matrices makes sense.  But my
understanding is the SVD is often a form of lossy compression.  

[Ed Porter] As I understand it, it is the non-rectangular equivalent of
finding the largest eigenvectors of matrix, and it involves discounting the
information contained in the less important (smaller eigenvalue) dimensions.
I assume an AGO could afford a certain amount of such error due to loss of
detail, for tasks similar to the 20K word example where the goal was to
provide a simplified representation of bi-grams, which are just statistical
correlations in ordered word pairs, and might not be appropriate for
representing episodic memory when more crisp, and less statistical info is
of importance.

[Ed Porter] Also, it would seem that dynamically learning new patterns would
present problems with a system in which the interconnects between matrices
have been determined by such an SVD process.  Perhaps, there would be a
separate system for learning new information, and information that
represented important exceptions to the SVD compression.

More generally, we know from chaos theory that complex systems must limit
the number of interconnections to be stable [3], which suggests that many AI
problems in general can be decomposed this way.

Remember we need not model the human brain in precise detail, since our goal
is to solve AGI by any means. We are allowed to use more efficient
algorithms if we discover them.

I ran some benchmarks on my PC (2.2 GHz Athlon-64, 3500+). It copies large
arrays at 1 GB per second using MMX or SSE2, which is not quite fast enough
for a 10^5 by 10^5 neural network simulation.

[Ed Porter] I assume this is on an uncompressed matrix. I don't know what
the overhead would be to compress the matrix, such as by representing all
the elements that are empty with run-length encoding, and then trying to
process it.  Presumably to use mmx or sse2 you would have

RE: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-14 Thread Matt Mahoney
--- On Sat, 6/14/08, Ed Porter [EMAIL PROTECTED] wrote:
 [Ed Porter] I still think you are going to need multi-bit weights at
 row-column element in the matrix -- since most all representations of
 synapses I have seen have assumed a weight having at least 6 bits of
 information, and there is reason to think you need to store both a short
 term and a long term value, since it seems to me it is necessary for the
 temporal correlation component of Hebbian learning, and to
 represent the state information that is short-term memory.  

There is a tradeoff between using a larger number of neurons (redundant 
representation of concepts) and a more precise representation of synaptic 
weights. There is also a tradeoff between representing the exact properties of 
synapses and approximating them from the properties of the neurons they connect.

 [Ed Porter] Having matrices for connecting the matrices
 makes sense.  But my understanding is the SVD is often a form of lossy
 compression.

That is true. When used to compress semantic relationships, it implements the 
transitive property. For example, if a word-word matrix learns the 
relationships rain-wet and wet-water, SVD will infer rain-water even if it was 
not seen in the training corpus.

SVD (or equivalently, a 3 layer neural network) could also be used to compress 
a mapping of pixels to characters for OCD. The hidden neurons (or largest 
eigenvalues) would represent intermediate features like line segments.

 I ran some benchmarks on my PC (2.2 GHz Athlon-64, 3500+). It copies
 large arrays at 1 GB per second using MMX or SSE2, which is not
 quite fast enough for a 10^5 by 10^5 neural network simulation.
 
 [Ed Porter] I assume this is on an uncompressed matrix. I don't know
 what the overhead would be to compress the matrix, such as by
 representing all the elements that are empty with run-length 
 encoding, and then trying to process it.  Presumably to use mmx or
 sse2 you would have to load some of the compressed matrix into
 cache, de-compress it, then run it as a block through the mmx or sse2.

It is probably not worth compressing a matrix with 10% density because of the 
time needed to decompress it. Decompressing runs of zeros is inefficient on a 
vector processor.

SSE2 has prefetch instructions so that matrix elements can already be in cache 
when it is ready to use. However modern processors usually detect sequential 
memory access and prefetch automatically.

-- Matt Mahoney, [EMAIL PROTECTED]





---
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244id_secret=106510220-47b225
Powered by Listbox: http://www.listbox.com


RE: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-13 Thread Ed Porter
Matt,

Thank you for your reply.  For me it is very thought provoking.

-Original Message-
From: Matt Mahoney [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 12, 2008 7:23 PM
To: agi@v2.listbox.com
Subject: RE: [agi] IBM, Los Alamos scientists claim fastest computer

--- On Thu, 6/12/08, Ed Porter [EMAIL PROTECTED] wrote:

 I think processor to memory, and inter processor
 communications are currently far short

Your concern is over the added cost of implementing a sparsely connected
network, which slows memory access and requires more memory for
representation (e.g. pointers in addition to a weight matrix). We can
alleviate much of the problem by using connection locality.
[Ed Porter] this would certainly be true if it worked


The brain has about 10^11 neurons with 10^4 synapses per neuron. If we
divide this work among 10^6 processors, each representing 1 mm^3 of brain
tissue, then each processor must implement 10^5 neurons and 10^9 synapses.
By my earlier argument, there can be at most 10^6 external connection
assuming 1-2 micron nerve fiber diameter, 

[Ed Porter] -- Why couldn't each of the 10^6 fibers have multiple
connections along its length within the cm^3 (although it could be
represented as one row in the matrix, with individual connections
represented as elements in such a row)


so half of the connections must be local. This is true at any scale because
when you double the size of a cube, you increase the number of neurons by 8
but increase the number of external connections by 4. Thus, for any size
cube, half of the external connections are to neighboring cubes and half are
to more distant cubes.

[Ed Porter] -- I am getting lost here.  Why are half the connections local.
You implied there are 10^6 external connections in the cm^3, and 10^9
synapses, which are the connections.  Thus the 10^6 external connections you
mention are only 1/1000 of 10^9 total connections you mention in the cm^3,
not one half as you say.  I understand that there are likely to be as many
connections leaving the cube as going into it, which is related, but not
that same thing as saying half the connection in the cm^3 are external. 

[Ed Porter] -- It is true that the rate of change for each doubling in scale
of the ratio surface to volume remains 1/2, but the actual ratio of surface
to volume decreases by 1/2 at each such doubling of scale, meaning the ratio
actually DOES CHANGE with scaling, rather remaining constant, as indicated
above.

A 1 mm^3 cube can be implemented as a fully connected 10^5 by 10^5 matrix of
10^10 connections. This could be implemented as a 1.25 GB array of bits with
5% of bits set to 1 representing a connection. 

[Ed Porter] a synaps would have multiple weights, such as short term and
long term weights and they would each be more than one bit.  Plus some are
excitatory and others or inhibitory, so they would have differing signs. So
multiple bits, probably at least two bytes would be necessary per element in
the matrix.

[Ed Porter] -- Also you haven't explained how you efficiently do the
activation between cubes (I assume it would be by having a row for each
neuron that projects axons into the cube, and a column for each neuron that
projects a dendrite into it).  This could still be represented by the
matrix, but it would tend to increase its sparseness.

[Ed Porter] -- Learning changes in which dendrites and axons projected into
a cube would require changing the matrix, which is doable, but can make
things more complicated. Another issue is how many other cubes would each
cm^3 communicate with. Are we talking 10, 100, 10^3, 10^4, 10^5, or 10^6.
The number could have a significant impact on communication costs.

[Ed Porter] -- I don't think this system would be good for my current model
for AGI representation, which is based on a type of graph matching, rather
than just a simple summing of synaptic inputs.

The internal computation bottleneck is the vector product which would be
implemented using 128 bit AND instructions in SSE2 at full serial memory
bandwidth. External communication is at most one bit per connected neuron
every cycle (20-100 ms), because the connectivity graph does not change
rapidly. A randomly connected sparse network could be described compactly
using hash functions.

[Ed Porter] -- It is interesting to think that this actually could be used
to speed up the processing of simple neural models.  I understand how the
row values associated with the axon synapses of a given neuron could be read
rapidly in a serial manner.  And how run-length encoding, or some other
means could be used to more compactly represent a sparse matrix.  I also
understand how the contributions for to the activation for each of the 10^5
columns made by each row could be stored in L2 cache at a rate of about
100mHz. 

[Ed Porter] -- L2 Cache write commonly take about 10 to 20clock cycles. ---
perhaps you could write them into memory blocks in L1 cache, which might
only take about two

RE: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-13 Thread Matt Mahoney
--- On Fri, 6/13/08, Ed Porter [EMAIL PROTECTED] wrote:
 [Ed Porter] -- Why couldn't each of the 10^6 fibers
 have multiple connections along its length within the cm^3 (although it
 could be represented as one row in the matrix, with individual
 connections represented as elements in such a row)

I think you mean 10^6 fibers in 1 cubic millimeter. They would have multiple 
connections, but I am only counting interprocessor communication, which is 1 
bit to transmit the state of the neuron (on or off) or a few bits to transmit 
its activation level to neighboring processors.

With regard to representing different types of synapses (various time delays, 
strength bounds, learning rates, etc), this information can be recorded as 
characteristics of the input and output neurons and derived as needed to save 
space.

Minimizing inter-processor communication is a harder problem. This can be done 
by mapping the neural network into a hierarchical organization so that groups 
of co-located neurons are forced to communicate with other groups through 
narrow channels using a small number of neurons. We know that many problems can 
be solve this way. For example, a semantic language model made of a 20K by 20K 
word association matrix can be represented using singular value decomposition 
as a 3 layer neural network with about 100 to 200 hidden neurons [1,2]. The two 
weight matrices could then be implemented on separate processors which 
communicate through the hidden layer neurons. More generally, we know from 
chaos theory that complex systems must limit the number of interconnections to 
be stable [3], which suggests that many AI problems in general can be 
decomposed this way.

Remember we need not model the human brain in precise detail, since our goal is 
to solve AGI by any means. We are allowed to use more efficient algorithms if 
we discover them.

I ran some benchmarks on my PC (2.2 GHz Athlon-64, 3500+). It copies large 
arrays at 1 GB per second using MMX or SSE2, which is not quite fast enough for 
a 10^5 by 10^5 neural network simulation.

1. Bellegarda, Jerome R., John W. Butzberger, Yen-Lu Chow, Noah B. Coccaro, 
Devang Naik (1996), “A novel word clustering algorithm based on latent semantic 
analysis”, Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 
vol. 1, 172-175.

2. Gorrell, Genevieve (2006), “Generalized Hebbian Algorithm for Incremental 
Singular Value Decomposition in Natural Language Processing”, Proceedings of 
EACL 2006, Trento, Italy.
http://www.aclweb.org/anthology-new/E/E06/E06-1013.pdf

3. Kauffman, Stuart A. (1991), “Antichaos and Adaptation”, Scientific American, 
Aug. 1991, p. 64.


-- Matt Mahoney, [EMAIL PROTECTED]



---
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244id_secret=106510220-47b225
Powered by Listbox: http://www.listbox.com


Re: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-12 Thread Brad Paulsen
If anyone is interested, I have some additional information on the C870 
NVIDIA Tesla card.  I'll be happy to send it to you off-list.  Just 
contact me directly.


Cheers,

Brad


---
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26
Powered by Listbox: http://www.listbox.com


Re: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-12 Thread Matt Mahoney
--- On Wed, 6/11/08, J Storrs Hall, PhD [EMAIL PROTECTED] wrote:

 Hmmph.  I offer to build anyone who wants one a
 human-capacity machine for 
 $100K, using currently available stock parts, in one rack.
 Approx 10  teraflops, using Teslas.
 (http://www.nvidia.com/object/tesla_c870.html)
 
 The software needs a little work...

Um, that's 10 petaflops, not 10 teraflops. I'm assuming a neural network with 
10^15 synapses (about 1 or 2 byte each) with 20 to 100 ms resolution, 10^16 to 
10^17 operations per second.  One Tesla = 350 GFLOPS, 1.5 GB, 120W, $1.3K.  So 
maybe $1 billion and 100 MW of power for a few hundred thousand of these plus 
glue.


-- Matt Mahoney, [EMAIL PROTECTED]





---
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26
Powered by Listbox: http://www.listbox.com


Re: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-12 Thread J Storrs Hall, PhD
Right. You're talking Kurzweil HEPP and I'm talking Moravec HEPP (and shading 
that a little). 

I may want your gadget when I go to upload, though.

Josh

On Thursday 12 June 2008 10:59:51 am, Matt Mahoney wrote:
 --- On Wed, 6/11/08, J Storrs Hall, PhD [EMAIL PROTECTED] wrote:
 
  Hmmph.  I offer to build anyone who wants one a
  human-capacity machine for 
  $100K, using currently available stock parts, in one rack.
  Approx 10  teraflops, using Teslas.
  (http://www.nvidia.com/object/tesla_c870.html)
  
  The software needs a little work...
 
 Um, that's 10 petaflops, not 10 teraflops. I'm assuming a neural network 
with 10^15 synapses (about 1 or 2 byte each) with 20 to 100 ms resolution, 
10^16 to 10^17 operations per second.  One Tesla = 350 GFLOPS, 1.5 GB, 120W, 
$1.3K.  So maybe $1 billion and 100 MW of power for a few hundred thousand of 
these plus glue.
 
 
 -- Matt Mahoney, [EMAIL PROTECTED]
 
 
 
 
 
 ---
 agi
 Archives: http://www.listbox.com/member/archive/303/=now
 RSS Feed: http://www.listbox.com/member/archive/rss/303/
 Modify Your Subscription: 
http://www.listbox.com/member/?;
 Powered by Listbox: http://www.listbox.com
 




---
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26
Powered by Listbox: http://www.listbox.com


RE: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-12 Thread Derek Zahn
 TeslasTwo things I think are interesting about these trends in 
 high-performance commodity hardware:
 
1) The flops/bit ratio (processing power vs memory) is skyrocketing.  The 
move to parallel architectures makes the number of high-level operations per 
transistor go up, but bits of memory per transistor in large memory circuits 
doesn't go up.  The old bit per op/s or byte per op/s rules of thumb get 
really broken on things like Tesla (0.03 bit/flops).  Of course we don't know 
the ratio needed for de novo AGI or brain modeling, but the assumptions about 
processing vs memory certainly seem to be changing.
 
2) Much more than previously, effective utilization of processor operations 
requires incredibly high locality (processing cores only have immediate access 
to very small memories).  This is also referred to as arithmetic intensity.  
This of course is because parallelism causes operations per second to expand 
much faster than methods for increasing memory bandwidth to large banks.  
Perhaps future 3D layering techniques will help with this problem, but for now 
AGI paradigms hoping to cache in (yuk yuk) on these hyperincreases in FLOPS 
need to be geared to high arithmetic intensity.
 
Interestingly (to me), these two things both imply to me that we get to 
increase the complexity of neuron and synapse models beyond the muladd/synapse 
+ simple activation function model with essentially no degradation in 
performance since the bandwidth of propagating values between neurons is the 
bottleneck much more than local processing inside the neuron model.
 


---
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26
Powered by Listbox: http://www.listbox.com


RE: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-12 Thread Matt Mahoney
I think the ratio of processing power to memory to bandwidth is just about 
right for AGI. Processing power and memory increase at about the same rate 
under Moore's Law. The time it takes a modern computer to clear all of its 
memory is on the same order as the response time as a neuron, and this has not 
changed much since ENIAC and the Commodore 64. It would seem easier to increase 
processing density than memory density but we are constrained by power 
consumption, heat dissipation, network bandwidth, and the lack of software and 
algorithms for parallel computation.

Bandwidth is about right too. A modern PC can simulate about 1 mm^3 of brain 
tissue with 10^9 synapses at 0.1 ms resolution or so. Nerve fibers have a 
diameter around 1 or 2 microns, so a 1 mm cube would have about 10^6 of these 
transmitting 10 bits per second, or 10 Mb/s. Similar calculations for larger 
cubes show locality with bandwidth growing at O(n^2/3). This could be handled 
by an Ethernet cluster with a high speed core using off the shelf hardware.

I don't know if it is coincidence that these 3 technologies are in the right 
ratio, or if it driven by the needs of software that compliment the human mind.

-- Matt Mahoney, [EMAIL PROTECTED]

--- On Thu, 6/12/08, Derek Zahn [EMAIL PROTECTED] wrote:
From: Derek Zahn [EMAIL PROTECTED]
Subject: RE: [agi] IBM, Los Alamos scientists claim fastest computer
To: agi@v2.listbox.com
Date: Thursday, June 12, 2008, 11:36 AM

Two things I think are interesting about these trends in high-performance 
commodity hardware:

1) The flops/bit ratio (processing power vs memory) is skyrocketing.  The 
move to parallel architectures makes the number of high-level operations per 
transistor go up, but bits of memory per transistor in large memory circuits 
doesn't go up.  The old bit per op/s or byte per op/s rules of thumb get 
really broken on things like Tesla (0.03 bit/flops).  Of course we don't know 
the ratio needed for de novo AGI or brain modeling, but the assumptions about 
processing vs memory certainly seem to be changing.

2) Much more than previously, effective utilization of processor operations 
requires incredibly high locality (processing cores only have immediate access 
to very small memories).  This is also referred to as arithmetic intensity.  
This of course is because parallelism causes operations per second to expand 
much faster than methods for increasing memory bandwidth to large banks.  
Perhaps future 3D layering techniques will help with this problem, but for now 
AGI paradigms hoping to cache in (yuk yuk) on these hyperincreases in FLOPS 
need to be geared to high arithmetic intensity.

Interestingly (to me), these two things both imply to me that we get to 
increase the complexity of neuron and synapse models beyond the muladd/synapse 
+ simple activation function model with essentially no degradation in 
performance since the bandwidth of propagating values between neurons is the 
bottleneck much more than local processing inside the neuron model.




---
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26
Powered by Listbox: http://www.listbox.com


Re: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-12 Thread Matt Mahoney
--- On Thu, 6/12/08, Mike Tintner [EMAIL PROTECTED] wrote:

 Matt:I think the ratio of processing power to memory to
 bandwidth is just about right for AGI.
 
 All these calculations (wh. are v. interesting) presume
 that all computing 
 is done in the brain. They ignore the possibility (well,
 certainty) of 
 morphological computing being done elsewhere in the system.
  Do you take any  interest in morphological computing? 

I assume you mean the implicit computation done by our sensory organs and 
muscles. Yes, but I don't think that has a big effect on my estimates.

-- Matt Mahoney, [EMAIL PROTECTED]



---
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26
Powered by Listbox: http://www.listbox.com


RE: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-12 Thread Ed Porter
I think processor to memory, and inter processor communications are
currently far short




-Original Message-
From: Matt Mahoney [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 12, 2008 12:33 PM
To: agi@v2.listbox.com
Subject: RE: [agi] IBM, Los Alamos scientists claim fastest computer

 Matt Mahoney ## 
I think the ratio of processing power to memory to bandwidth is just about
right for AGI. 

Ed Porter ## 
I tend to think otherwise. I think the current processor-to-RAM and
processor-to-processor bandwidths are too low.  

(PLEASE CORRECT ME IF YOU THINK ANY OF MY BELOW CALCULATIONS OR STATEMENTS
ARE INCORRECT)

The average synapse fires over once per second on average. The brain has
roughly 10^12 - 10^15 synapses (the lower figure is based on some peoples'
claim that only 1% of synapses are really effective).  Since each synapse
activation involve at least two or more memory accesses (at least a
read-modify-write) that would involve roughly a similar number of memory
accesses per second.  Because of the high degree of irregularity and
non-locality of connections in the brain, many of such accesses would have
to be modeled by non-sequential RAM accesses.  Since --- as is stated below
in more detail --- a current processor can only average roughly about 10^7
non-sequential read-modify-writes per second, that means 10^5 - 10^8
processors would be required just to access RAM at the same rate the brain
accesses memory at its synapses, with 10^5 probably being a low number.  

But a significant number of the equivalent of synapse activations would
require inter-processor communication in an AGI made out of current computer
hardware. If one has only on the order of 10^5 processors, load balancing
becomes an issue.  And to minimize this you actually want a fair amount of
non-locality of memory. (For example, when they put Shastri's Shruiti
cognitive architecture on a Thinking Machine, they purposely randomized the
distribution of data across the machines memory to promote load balancing.)
(Load balancing is not an issue in the brain, since the brain has the
equivalent a simple, but parallel, processor for its equivalent of roughly
every 100 to 10K synapses.)  Thus, you are probably talking in terms of
needing to be able to send something in the rough ball park of 10^9 to 10^12
short, inter-processor messages a second.  To do this without having
congestion problems, you are probably going to need a theoretical bandwidth
5 to 10 times that.

One piece of hardware that would be a great machine to run test AGI's on is
the roughly $60M TACC Ranger supercomputer in Austin, TX.  It includes
15,700 AMD quadcores, for over 63K cores, and about 100TB of RAM.  Most
importantly it has Sun's very powerful Constellation system switch with 3456
(an easy to remember number) , 20Gbit infiniband bi-directional ports, which
is a theoretical x-secontional bandwidth of roughly 6.9TByte/sec.  If the
average spreading activation message were 32bytes, and if they were packed
into larger blocks to reduce per/msg costs, if and, and you assumed roughly
only 10 percent of the total capacity was used on average to prevent
congestion, that would allow roughly 20 billion global messages a second,
with each of the 3456 roughly quad core nodes receiving about 5 million per
second. 

(If any body has any info on how many random memory accesses a
quad-processor quad-core node can do/sec, I would be very interested --- I
am guessing between 80 to 320 million/sec)

I would not be surprised if the Ranger's inter-processor and
processor-to-RAM bandwidth is one or two orders of magnitude too low for
many types of human level thinking, but it would certainly be enough to do
very valuable AGI research, and to build powerful intelligences that would
be in many ways more powerful than human.
   
 Matt Mahoney ##
Processing power and memory increase at about the same rate under Moore's
Law. 

Ed Porter ## 
Yes, but the frequency of non-sequential processor-to-memory accesses has
increased much more slowly. ( This may change in the future with the
development of the type of massively multi-core chips, with built in high
bandwidth mesh networks, with, say, 10 RAM layers over each processor, in
which the layers of each such chip are connected with through silicon vias
that Sam Adams says he is now working on.  Hopefully, each such multi-layer
chips will be connected with hundreds of high bandwidth communication
channels, could help change this.  So also could processor-in-memory chips.)




 Matt Mahoney ##The time it takes a modern computer to clear all
of its memory is on the same order as the response time as a neuron, and
this has not changed much since ENIAC and the Commodore 64. It would seem
easier to increase processing density than memory density but we are
constrained by power consumption, heat dissipation, network bandwidth, and
the lack of software and algorithms for parallel computation

Re: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-12 Thread Kingma, D.P.
As far as I know, GPU's are not very optimal for neural net calculation. For
some applications, speedup factors come in the 1000 range, but for NN's I
have only seen speedups of one order of magnitude (10x).

For example, see attached paper

On Thu, Jun 12, 2008 at 4:59 PM, Matt Mahoney [EMAIL PROTECTED] wrote:

 --- On Wed, 6/11/08, J Storrs Hall, PhD [EMAIL PROTECTED] wrote:

  Hmmph.  I offer to build anyone who wants one a
  human-capacity machine for
  $100K, using currently available stock parts, in one rack.
  Approx 10  teraflops, using Teslas.
  (http://www.nvidia.com/object/tesla_c870.html)
 
  The software needs a little work...

 Um, that's 10 petaflops, not 10 teraflops. I'm assuming a neural network
 with 10^15 synapses (about 1 or 2 byte each) with 20 to 100 ms resolution,
 10^16 to 10^17 operations per second.  One Tesla = 350 GFLOPS, 1.5 GB, 120W,
 $1.3K.  So maybe $1 billion and 100 MW of power for a few hundred thousand
 of these plus glue.


 -- Matt Mahoney, [EMAIL PROTECTED]





 ---
 agi
 Archives: http://www.listbox.com/member/archive/303/=now
 RSS Feed: http://www.listbox.com/member/archive/rss/303/
 Modify Your Subscription:
 http://www.listbox.com/member/?;
 Powered by Listbox: http://www.listbox.com




---
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26
Powered by Listbox: http://www.listbox.com


RE: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-12 Thread Matt Mahoney
--- On Thu, 6/12/08, Ed Porter [EMAIL PROTECTED] wrote:

 I think processor to memory, and inter processor
 communications are currently far short

Your concern is over the added cost of implementing a sparsely connected 
network, which slows memory access and requires more memory for representation 
(e.g. pointers in addition to a weight matrix). We can alleviate much of the 
problem by using connection locality.

The brain has about 10^11 neurons with 10^4 synapses per neuron. If we divide 
this work among 10^6 processors, each representing 1 mm^3 of brain tissue, then 
each processor must implement 10^5 neurons and 10^9 synapses. By my earlier 
argument, there can be at most 10^6 external connection assuming 1-2 micron 
nerve fiber diameter, so half of the connections must be local. This is true at 
any scale because when you double the size of a cube, you increase the number 
of neurons by 8 but increase the number of external connections by 4. Thus, for 
any size cube, half of the external connections are to neighboring cubes and 
half are to more distant cubes.

A 1 mm^3 cube can be implemented as a fully connected 10^5 by 10^5 matrix of 
10^10 connections. This could be implemented as a 1.25 GB array of bits with 5% 
of bits set to 1 representing a connection. The internal computation bottleneck 
is the vector product which would be implemented using 128 bit AND instructions 
in SSE2 at full serial memory bandwidth. External communication is at most one 
bit per connected neuron every cycle (20-100 ms), because the connectivity 
graph does not change rapidly. A randomly connected sparse network could be 
described compactly using hash functions.

Also, there are probably more efficient implementations of AGI than modeling 
the brain because we are not constrained to use slow neurons. For example, low 
level visual feature detection could be implemented serially by sliding a 
coefficient window over a 2-D image rather than by maintaining sets of 
identical weights for each different region of the image like the brain does. I 
don't think we really need 10^15 bits to implement the 10^9 bits of long term 
memory that Landauer says we have.

-- Matt Mahoney, [EMAIL PROTECTED]



---
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26
Powered by Listbox: http://www.listbox.com


Re: [agi] IBM, Los Alamos scientists claim fastest computer

2008-06-11 Thread J Storrs Hall, PhD
Hmmph.  I offer to build anyone who wants one a human-capacity machine for 
$100K, using currently available stock parts, in one rack. Approx 10 
teraflops, using Teslas. (http://www.nvidia.com/object/tesla_c870.html)

The software needs a little work...

Josh


On Wednesday 11 June 2008 08:50:58 pm, Matt Mahoney wrote:
 http://www.chron.com/disp/story.mpl/business/5826863.html
 
 World's fastest computer at 1 petaflop and 80 TB memory. Cost US $100 
million.  Claims 1 watt per 376 million calculations, which comes to 2.6 
megawatts if my calculations are correct.
 
 So with about 10 of these, I think we should be on our way to simulating a 
human brain sized neural network.
 
 -- Matt Mahoney, [EMAIL PROTECTED]
 
 


---
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26
Powered by Listbox: http://www.listbox.com