RE: [agi] IBM, Los Alamos scientists claim fastest computer
Matt, As acknowledged before, your notion of using a single matrix to represent a neural net with say 10^5 neurons, each with up to 10^4 connections, has the benefit of allowing for much more efficient processing, since the rows could be read serially, which allows much faster memory access, and the output from the processing along each row that is contributed to column values, could be stored largely in cache, allowing for relatively rapid storage, even if the representations of contacts are relatively sparse along each row, and since the equivalent of address representations of contacts could be represented much more efficiently. I had known neural nets were often represented by matrices, but It is still not totally clear to me that it would work well with the type of system I am interested in. But it might. -Original Message- From: Matt Mahoney [mailto:[EMAIL PROTECTED] Sent: Friday, June 13, 2008 2:30 PM To: agi@v2.listbox.com Subject: RE: [agi] IBM, Los Alamos scientists claim fastest computer With regard to representing different types of synapses (various time delays, strength bounds, learning rates, etc), this information can be recorded as characteristics of the input and output neurons and derived as needed to save space. [Ed Porter] I still think you are going to need multi-bit weights at row-column element in the matrix -- since most all representations of synapses I have seen have assumed a weight having at least 6 bits of information, and there is reason to think you need to store both a short term and a long term value, since it seems to me it is necessary for the temporal correlation component of Hebbian learning, and to represent the state information that is short-term memory. [Ed Porter] I don't understand how this information could be stored or accessed more efficiently than in row-column matrix elements, so they would be available when computing a row's contribution to each successive column. This would include any information such as whether excitatory or inhibitory, delays, strength bounds, learning rates, unless you assume such information is uniform in each row or each column. (In the brain, this may be the case for some attributes --such as, I think, having separate neurons for inhibition and excitation, but it would appears to require added complexity for many learning tasks). Minimizing inter-processor communication is a harder problem. This can be done by mapping the neural network into a hierarchical organization so that groups of co-located neurons are forced to communicate with other groups through narrow channels using a small number of neurons. We know that many problems can be solve this way. For example, a semantic language model made of a 20K by 20K word association matrix can be represented using singular value decomposition as a 3 layer neural network with about 100 to 200 hidden neurons [1,2]. The two weight matrices could then be implemented on separate processors which communicate through the hidden layer neurons. [Ed Porter] Having matrices for connecting the matrices makes sense. But my understanding is the SVD is often a form of lossy compression. [Ed Porter] As I understand it, it is the non-rectangular equivalent of finding the largest eigenvectors of matrix, and it involves discounting the information contained in the less important (smaller eigenvalue) dimensions. I assume an AGO could afford a certain amount of such error due to loss of detail, for tasks similar to the 20K word example where the goal was to provide a simplified representation of bi-grams, which are just statistical correlations in ordered word pairs, and might not be appropriate for representing episodic memory when more crisp, and less statistical info is of importance. [Ed Porter] Also, it would seem that dynamically learning new patterns would present problems with a system in which the interconnects between matrices have been determined by such an SVD process. Perhaps, there would be a separate system for learning new information, and information that represented important exceptions to the SVD compression. More generally, we know from chaos theory that complex systems must limit the number of interconnections to be stable [3], which suggests that many AI problems in general can be decomposed this way. Remember we need not model the human brain in precise detail, since our goal is to solve AGI by any means. We are allowed to use more efficient algorithms if we discover them. I ran some benchmarks on my PC (2.2 GHz Athlon-64, 3500+). It copies large arrays at 1 GB per second using MMX or SSE2, which is not quite fast enough for a 10^5 by 10^5 neural network simulation. [Ed Porter] I assume this is on an uncompressed matrix. I don't know what the overhead would be to compress the matrix, such as by representing all the elements that are empty with run-length encoding, and then trying to process it. Presumably to use mmx or sse2 you would have
RE: [agi] IBM, Los Alamos scientists claim fastest computer
--- On Sat, 6/14/08, Ed Porter [EMAIL PROTECTED] wrote: [Ed Porter] I still think you are going to need multi-bit weights at row-column element in the matrix -- since most all representations of synapses I have seen have assumed a weight having at least 6 bits of information, and there is reason to think you need to store both a short term and a long term value, since it seems to me it is necessary for the temporal correlation component of Hebbian learning, and to represent the state information that is short-term memory. There is a tradeoff between using a larger number of neurons (redundant representation of concepts) and a more precise representation of synaptic weights. There is also a tradeoff between representing the exact properties of synapses and approximating them from the properties of the neurons they connect. [Ed Porter] Having matrices for connecting the matrices makes sense. But my understanding is the SVD is often a form of lossy compression. That is true. When used to compress semantic relationships, it implements the transitive property. For example, if a word-word matrix learns the relationships rain-wet and wet-water, SVD will infer rain-water even if it was not seen in the training corpus. SVD (or equivalently, a 3 layer neural network) could also be used to compress a mapping of pixels to characters for OCD. The hidden neurons (or largest eigenvalues) would represent intermediate features like line segments. I ran some benchmarks on my PC (2.2 GHz Athlon-64, 3500+). It copies large arrays at 1 GB per second using MMX or SSE2, which is not quite fast enough for a 10^5 by 10^5 neural network simulation. [Ed Porter] I assume this is on an uncompressed matrix. I don't know what the overhead would be to compress the matrix, such as by representing all the elements that are empty with run-length encoding, and then trying to process it. Presumably to use mmx or sse2 you would have to load some of the compressed matrix into cache, de-compress it, then run it as a block through the mmx or sse2. It is probably not worth compressing a matrix with 10% density because of the time needed to decompress it. Decompressing runs of zeros is inefficient on a vector processor. SSE2 has prefetch instructions so that matrix elements can already be in cache when it is ready to use. However modern processors usually detect sequential memory access and prefetch automatically. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244id_secret=106510220-47b225 Powered by Listbox: http://www.listbox.com
RE: [agi] IBM, Los Alamos scientists claim fastest computer
Matt, Thank you for your reply. For me it is very thought provoking. -Original Message- From: Matt Mahoney [mailto:[EMAIL PROTECTED] Sent: Thursday, June 12, 2008 7:23 PM To: agi@v2.listbox.com Subject: RE: [agi] IBM, Los Alamos scientists claim fastest computer --- On Thu, 6/12/08, Ed Porter [EMAIL PROTECTED] wrote: I think processor to memory, and inter processor communications are currently far short Your concern is over the added cost of implementing a sparsely connected network, which slows memory access and requires more memory for representation (e.g. pointers in addition to a weight matrix). We can alleviate much of the problem by using connection locality. [Ed Porter] this would certainly be true if it worked The brain has about 10^11 neurons with 10^4 synapses per neuron. If we divide this work among 10^6 processors, each representing 1 mm^3 of brain tissue, then each processor must implement 10^5 neurons and 10^9 synapses. By my earlier argument, there can be at most 10^6 external connection assuming 1-2 micron nerve fiber diameter, [Ed Porter] -- Why couldn't each of the 10^6 fibers have multiple connections along its length within the cm^3 (although it could be represented as one row in the matrix, with individual connections represented as elements in such a row) so half of the connections must be local. This is true at any scale because when you double the size of a cube, you increase the number of neurons by 8 but increase the number of external connections by 4. Thus, for any size cube, half of the external connections are to neighboring cubes and half are to more distant cubes. [Ed Porter] -- I am getting lost here. Why are half the connections local. You implied there are 10^6 external connections in the cm^3, and 10^9 synapses, which are the connections. Thus the 10^6 external connections you mention are only 1/1000 of 10^9 total connections you mention in the cm^3, not one half as you say. I understand that there are likely to be as many connections leaving the cube as going into it, which is related, but not that same thing as saying half the connection in the cm^3 are external. [Ed Porter] -- It is true that the rate of change for each doubling in scale of the ratio surface to volume remains 1/2, but the actual ratio of surface to volume decreases by 1/2 at each such doubling of scale, meaning the ratio actually DOES CHANGE with scaling, rather remaining constant, as indicated above. A 1 mm^3 cube can be implemented as a fully connected 10^5 by 10^5 matrix of 10^10 connections. This could be implemented as a 1.25 GB array of bits with 5% of bits set to 1 representing a connection. [Ed Porter] a synaps would have multiple weights, such as short term and long term weights and they would each be more than one bit. Plus some are excitatory and others or inhibitory, so they would have differing signs. So multiple bits, probably at least two bytes would be necessary per element in the matrix. [Ed Porter] -- Also you haven't explained how you efficiently do the activation between cubes (I assume it would be by having a row for each neuron that projects axons into the cube, and a column for each neuron that projects a dendrite into it). This could still be represented by the matrix, but it would tend to increase its sparseness. [Ed Porter] -- Learning changes in which dendrites and axons projected into a cube would require changing the matrix, which is doable, but can make things more complicated. Another issue is how many other cubes would each cm^3 communicate with. Are we talking 10, 100, 10^3, 10^4, 10^5, or 10^6. The number could have a significant impact on communication costs. [Ed Porter] -- I don't think this system would be good for my current model for AGI representation, which is based on a type of graph matching, rather than just a simple summing of synaptic inputs. The internal computation bottleneck is the vector product which would be implemented using 128 bit AND instructions in SSE2 at full serial memory bandwidth. External communication is at most one bit per connected neuron every cycle (20-100 ms), because the connectivity graph does not change rapidly. A randomly connected sparse network could be described compactly using hash functions. [Ed Porter] -- It is interesting to think that this actually could be used to speed up the processing of simple neural models. I understand how the row values associated with the axon synapses of a given neuron could be read rapidly in a serial manner. And how run-length encoding, or some other means could be used to more compactly represent a sparse matrix. I also understand how the contributions for to the activation for each of the 10^5 columns made by each row could be stored in L2 cache at a rate of about 100mHz. [Ed Porter] -- L2 Cache write commonly take about 10 to 20clock cycles. --- perhaps you could write them into memory blocks in L1 cache, which might only take about two
RE: [agi] IBM, Los Alamos scientists claim fastest computer
--- On Fri, 6/13/08, Ed Porter [EMAIL PROTECTED] wrote: [Ed Porter] -- Why couldn't each of the 10^6 fibers have multiple connections along its length within the cm^3 (although it could be represented as one row in the matrix, with individual connections represented as elements in such a row) I think you mean 10^6 fibers in 1 cubic millimeter. They would have multiple connections, but I am only counting interprocessor communication, which is 1 bit to transmit the state of the neuron (on or off) or a few bits to transmit its activation level to neighboring processors. With regard to representing different types of synapses (various time delays, strength bounds, learning rates, etc), this information can be recorded as characteristics of the input and output neurons and derived as needed to save space. Minimizing inter-processor communication is a harder problem. This can be done by mapping the neural network into a hierarchical organization so that groups of co-located neurons are forced to communicate with other groups through narrow channels using a small number of neurons. We know that many problems can be solve this way. For example, a semantic language model made of a 20K by 20K word association matrix can be represented using singular value decomposition as a 3 layer neural network with about 100 to 200 hidden neurons [1,2]. The two weight matrices could then be implemented on separate processors which communicate through the hidden layer neurons. More generally, we know from chaos theory that complex systems must limit the number of interconnections to be stable [3], which suggests that many AI problems in general can be decomposed this way. Remember we need not model the human brain in precise detail, since our goal is to solve AGI by any means. We are allowed to use more efficient algorithms if we discover them. I ran some benchmarks on my PC (2.2 GHz Athlon-64, 3500+). It copies large arrays at 1 GB per second using MMX or SSE2, which is not quite fast enough for a 10^5 by 10^5 neural network simulation. 1. Bellegarda, Jerome R., John W. Butzberger, Yen-Lu Chow, Noah B. Coccaro, Devang Naik (1996), “A novel word clustering algorithm based on latent semantic analysis”, Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 172-175. 2. Gorrell, Genevieve (2006), “Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing”, Proceedings of EACL 2006, Trento, Italy. http://www.aclweb.org/anthology-new/E/E06/E06-1013.pdf 3. Kauffman, Stuart A. (1991), “Antichaos and Adaptation”, Scientific American, Aug. 1991, p. 64. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244id_secret=106510220-47b225 Powered by Listbox: http://www.listbox.com
Re: [agi] IBM, Los Alamos scientists claim fastest computer
If anyone is interested, I have some additional information on the C870 NVIDIA Tesla card. I'll be happy to send it to you off-list. Just contact me directly. Cheers, Brad --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26 Powered by Listbox: http://www.listbox.com
Re: [agi] IBM, Los Alamos scientists claim fastest computer
--- On Wed, 6/11/08, J Storrs Hall, PhD [EMAIL PROTECTED] wrote: Hmmph. I offer to build anyone who wants one a human-capacity machine for $100K, using currently available stock parts, in one rack. Approx 10 teraflops, using Teslas. (http://www.nvidia.com/object/tesla_c870.html) The software needs a little work... Um, that's 10 petaflops, not 10 teraflops. I'm assuming a neural network with 10^15 synapses (about 1 or 2 byte each) with 20 to 100 ms resolution, 10^16 to 10^17 operations per second. One Tesla = 350 GFLOPS, 1.5 GB, 120W, $1.3K. So maybe $1 billion and 100 MW of power for a few hundred thousand of these plus glue. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26 Powered by Listbox: http://www.listbox.com
Re: [agi] IBM, Los Alamos scientists claim fastest computer
Right. You're talking Kurzweil HEPP and I'm talking Moravec HEPP (and shading that a little). I may want your gadget when I go to upload, though. Josh On Thursday 12 June 2008 10:59:51 am, Matt Mahoney wrote: --- On Wed, 6/11/08, J Storrs Hall, PhD [EMAIL PROTECTED] wrote: Hmmph. I offer to build anyone who wants one a human-capacity machine for $100K, using currently available stock parts, in one rack. Approx 10 teraflops, using Teslas. (http://www.nvidia.com/object/tesla_c870.html) The software needs a little work... Um, that's 10 petaflops, not 10 teraflops. I'm assuming a neural network with 10^15 synapses (about 1 or 2 byte each) with 20 to 100 ms resolution, 10^16 to 10^17 operations per second. One Tesla = 350 GFLOPS, 1.5 GB, 120W, $1.3K. So maybe $1 billion and 100 MW of power for a few hundred thousand of these plus glue. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?; Powered by Listbox: http://www.listbox.com --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26 Powered by Listbox: http://www.listbox.com
RE: [agi] IBM, Los Alamos scientists claim fastest computer
TeslasTwo things I think are interesting about these trends in high-performance commodity hardware: 1) The flops/bit ratio (processing power vs memory) is skyrocketing. The move to parallel architectures makes the number of high-level operations per transistor go up, but bits of memory per transistor in large memory circuits doesn't go up. The old bit per op/s or byte per op/s rules of thumb get really broken on things like Tesla (0.03 bit/flops). Of course we don't know the ratio needed for de novo AGI or brain modeling, but the assumptions about processing vs memory certainly seem to be changing. 2) Much more than previously, effective utilization of processor operations requires incredibly high locality (processing cores only have immediate access to very small memories). This is also referred to as arithmetic intensity. This of course is because parallelism causes operations per second to expand much faster than methods for increasing memory bandwidth to large banks. Perhaps future 3D layering techniques will help with this problem, but for now AGI paradigms hoping to cache in (yuk yuk) on these hyperincreases in FLOPS need to be geared to high arithmetic intensity. Interestingly (to me), these two things both imply to me that we get to increase the complexity of neuron and synapse models beyond the muladd/synapse + simple activation function model with essentially no degradation in performance since the bandwidth of propagating values between neurons is the bottleneck much more than local processing inside the neuron model. --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26 Powered by Listbox: http://www.listbox.com
RE: [agi] IBM, Los Alamos scientists claim fastest computer
I think the ratio of processing power to memory to bandwidth is just about right for AGI. Processing power and memory increase at about the same rate under Moore's Law. The time it takes a modern computer to clear all of its memory is on the same order as the response time as a neuron, and this has not changed much since ENIAC and the Commodore 64. It would seem easier to increase processing density than memory density but we are constrained by power consumption, heat dissipation, network bandwidth, and the lack of software and algorithms for parallel computation. Bandwidth is about right too. A modern PC can simulate about 1 mm^3 of brain tissue with 10^9 synapses at 0.1 ms resolution or so. Nerve fibers have a diameter around 1 or 2 microns, so a 1 mm cube would have about 10^6 of these transmitting 10 bits per second, or 10 Mb/s. Similar calculations for larger cubes show locality with bandwidth growing at O(n^2/3). This could be handled by an Ethernet cluster with a high speed core using off the shelf hardware. I don't know if it is coincidence that these 3 technologies are in the right ratio, or if it driven by the needs of software that compliment the human mind. -- Matt Mahoney, [EMAIL PROTECTED] --- On Thu, 6/12/08, Derek Zahn [EMAIL PROTECTED] wrote: From: Derek Zahn [EMAIL PROTECTED] Subject: RE: [agi] IBM, Los Alamos scientists claim fastest computer To: agi@v2.listbox.com Date: Thursday, June 12, 2008, 11:36 AM Two things I think are interesting about these trends in high-performance commodity hardware: 1) The flops/bit ratio (processing power vs memory) is skyrocketing. The move to parallel architectures makes the number of high-level operations per transistor go up, but bits of memory per transistor in large memory circuits doesn't go up. The old bit per op/s or byte per op/s rules of thumb get really broken on things like Tesla (0.03 bit/flops). Of course we don't know the ratio needed for de novo AGI or brain modeling, but the assumptions about processing vs memory certainly seem to be changing. 2) Much more than previously, effective utilization of processor operations requires incredibly high locality (processing cores only have immediate access to very small memories). This is also referred to as arithmetic intensity. This of course is because parallelism causes operations per second to expand much faster than methods for increasing memory bandwidth to large banks. Perhaps future 3D layering techniques will help with this problem, but for now AGI paradigms hoping to cache in (yuk yuk) on these hyperincreases in FLOPS need to be geared to high arithmetic intensity. Interestingly (to me), these two things both imply to me that we get to increase the complexity of neuron and synapse models beyond the muladd/synapse + simple activation function model with essentially no degradation in performance since the bandwidth of propagating values between neurons is the bottleneck much more than local processing inside the neuron model. --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26 Powered by Listbox: http://www.listbox.com
Re: [agi] IBM, Los Alamos scientists claim fastest computer
--- On Thu, 6/12/08, Mike Tintner [EMAIL PROTECTED] wrote: Matt:I think the ratio of processing power to memory to bandwidth is just about right for AGI. All these calculations (wh. are v. interesting) presume that all computing is done in the brain. They ignore the possibility (well, certainty) of morphological computing being done elsewhere in the system. Do you take any interest in morphological computing? I assume you mean the implicit computation done by our sensory organs and muscles. Yes, but I don't think that has a big effect on my estimates. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26 Powered by Listbox: http://www.listbox.com
RE: [agi] IBM, Los Alamos scientists claim fastest computer
I think processor to memory, and inter processor communications are currently far short -Original Message- From: Matt Mahoney [mailto:[EMAIL PROTECTED] Sent: Thursday, June 12, 2008 12:33 PM To: agi@v2.listbox.com Subject: RE: [agi] IBM, Los Alamos scientists claim fastest computer Matt Mahoney ## I think the ratio of processing power to memory to bandwidth is just about right for AGI. Ed Porter ## I tend to think otherwise. I think the current processor-to-RAM and processor-to-processor bandwidths are too low. (PLEASE CORRECT ME IF YOU THINK ANY OF MY BELOW CALCULATIONS OR STATEMENTS ARE INCORRECT) The average synapse fires over once per second on average. The brain has roughly 10^12 - 10^15 synapses (the lower figure is based on some peoples' claim that only 1% of synapses are really effective). Since each synapse activation involve at least two or more memory accesses (at least a read-modify-write) that would involve roughly a similar number of memory accesses per second. Because of the high degree of irregularity and non-locality of connections in the brain, many of such accesses would have to be modeled by non-sequential RAM accesses. Since --- as is stated below in more detail --- a current processor can only average roughly about 10^7 non-sequential read-modify-writes per second, that means 10^5 - 10^8 processors would be required just to access RAM at the same rate the brain accesses memory at its synapses, with 10^5 probably being a low number. But a significant number of the equivalent of synapse activations would require inter-processor communication in an AGI made out of current computer hardware. If one has only on the order of 10^5 processors, load balancing becomes an issue. And to minimize this you actually want a fair amount of non-locality of memory. (For example, when they put Shastri's Shruiti cognitive architecture on a Thinking Machine, they purposely randomized the distribution of data across the machines memory to promote load balancing.) (Load balancing is not an issue in the brain, since the brain has the equivalent a simple, but parallel, processor for its equivalent of roughly every 100 to 10K synapses.) Thus, you are probably talking in terms of needing to be able to send something in the rough ball park of 10^9 to 10^12 short, inter-processor messages a second. To do this without having congestion problems, you are probably going to need a theoretical bandwidth 5 to 10 times that. One piece of hardware that would be a great machine to run test AGI's on is the roughly $60M TACC Ranger supercomputer in Austin, TX. It includes 15,700 AMD quadcores, for over 63K cores, and about 100TB of RAM. Most importantly it has Sun's very powerful Constellation system switch with 3456 (an easy to remember number) , 20Gbit infiniband bi-directional ports, which is a theoretical x-secontional bandwidth of roughly 6.9TByte/sec. If the average spreading activation message were 32bytes, and if they were packed into larger blocks to reduce per/msg costs, if and, and you assumed roughly only 10 percent of the total capacity was used on average to prevent congestion, that would allow roughly 20 billion global messages a second, with each of the 3456 roughly quad core nodes receiving about 5 million per second. (If any body has any info on how many random memory accesses a quad-processor quad-core node can do/sec, I would be very interested --- I am guessing between 80 to 320 million/sec) I would not be surprised if the Ranger's inter-processor and processor-to-RAM bandwidth is one or two orders of magnitude too low for many types of human level thinking, but it would certainly be enough to do very valuable AGI research, and to build powerful intelligences that would be in many ways more powerful than human. Matt Mahoney ## Processing power and memory increase at about the same rate under Moore's Law. Ed Porter ## Yes, but the frequency of non-sequential processor-to-memory accesses has increased much more slowly. ( This may change in the future with the development of the type of massively multi-core chips, with built in high bandwidth mesh networks, with, say, 10 RAM layers over each processor, in which the layers of each such chip are connected with through silicon vias that Sam Adams says he is now working on. Hopefully, each such multi-layer chips will be connected with hundreds of high bandwidth communication channels, could help change this. So also could processor-in-memory chips.) Matt Mahoney ##The time it takes a modern computer to clear all of its memory is on the same order as the response time as a neuron, and this has not changed much since ENIAC and the Commodore 64. It would seem easier to increase processing density than memory density but we are constrained by power consumption, heat dissipation, network bandwidth, and the lack of software and algorithms for parallel computation
Re: [agi] IBM, Los Alamos scientists claim fastest computer
As far as I know, GPU's are not very optimal for neural net calculation. For some applications, speedup factors come in the 1000 range, but for NN's I have only seen speedups of one order of magnitude (10x). For example, see attached paper On Thu, Jun 12, 2008 at 4:59 PM, Matt Mahoney [EMAIL PROTECTED] wrote: --- On Wed, 6/11/08, J Storrs Hall, PhD [EMAIL PROTECTED] wrote: Hmmph. I offer to build anyone who wants one a human-capacity machine for $100K, using currently available stock parts, in one rack. Approx 10 teraflops, using Teslas. (http://www.nvidia.com/object/tesla_c870.html) The software needs a little work... Um, that's 10 petaflops, not 10 teraflops. I'm assuming a neural network with 10^15 synapses (about 1 or 2 byte each) with 20 to 100 ms resolution, 10^16 to 10^17 operations per second. One Tesla = 350 GFLOPS, 1.5 GB, 120W, $1.3K. So maybe $1 billion and 100 MW of power for a few hundred thousand of these plus glue. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?; Powered by Listbox: http://www.listbox.com --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26 Powered by Listbox: http://www.listbox.com
RE: [agi] IBM, Los Alamos scientists claim fastest computer
--- On Thu, 6/12/08, Ed Porter [EMAIL PROTECTED] wrote: I think processor to memory, and inter processor communications are currently far short Your concern is over the added cost of implementing a sparsely connected network, which slows memory access and requires more memory for representation (e.g. pointers in addition to a weight matrix). We can alleviate much of the problem by using connection locality. The brain has about 10^11 neurons with 10^4 synapses per neuron. If we divide this work among 10^6 processors, each representing 1 mm^3 of brain tissue, then each processor must implement 10^5 neurons and 10^9 synapses. By my earlier argument, there can be at most 10^6 external connection assuming 1-2 micron nerve fiber diameter, so half of the connections must be local. This is true at any scale because when you double the size of a cube, you increase the number of neurons by 8 but increase the number of external connections by 4. Thus, for any size cube, half of the external connections are to neighboring cubes and half are to more distant cubes. A 1 mm^3 cube can be implemented as a fully connected 10^5 by 10^5 matrix of 10^10 connections. This could be implemented as a 1.25 GB array of bits with 5% of bits set to 1 representing a connection. The internal computation bottleneck is the vector product which would be implemented using 128 bit AND instructions in SSE2 at full serial memory bandwidth. External communication is at most one bit per connected neuron every cycle (20-100 ms), because the connectivity graph does not change rapidly. A randomly connected sparse network could be described compactly using hash functions. Also, there are probably more efficient implementations of AGI than modeling the brain because we are not constrained to use slow neurons. For example, low level visual feature detection could be implemented serially by sliding a coefficient window over a 2-D image rather than by maintaining sets of identical weights for each different region of the image like the brain does. I don't think we really need 10^15 bits to implement the 10^9 bits of long term memory that Landauer says we have. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26 Powered by Listbox: http://www.listbox.com
Re: [agi] IBM, Los Alamos scientists claim fastest computer
Hmmph. I offer to build anyone who wants one a human-capacity machine for $100K, using currently available stock parts, in one rack. Approx 10 teraflops, using Teslas. (http://www.nvidia.com/object/tesla_c870.html) The software needs a little work... Josh On Wednesday 11 June 2008 08:50:58 pm, Matt Mahoney wrote: http://www.chron.com/disp/story.mpl/business/5826863.html World's fastest computer at 1 petaflop and 80 TB memory. Cost US $100 million. Claims 1 watt per 376 million calculations, which comes to 2.6 megawatts if my calculations are correct. So with about 10 of these, I think we should be on our way to simulating a human brain sized neural network. -- Matt Mahoney, [EMAIL PROTECTED] --- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244id_secret=103754539-40ed26 Powered by Listbox: http://www.listbox.com