Josh, On 4/15/08, J Storrs Hall, PhD <[EMAIL PROTECTED]> wrote: > > Either you're using static RAM (and getting a big hit in density and > power) or > DRAM, and getting a big hit in speed.
I have taken some chip design courses but have never actually designed any chips, so please correct any misconceptions that I may exhibit in the following discussion. As I understand things, speed requires low capacitance, which DRAM requires higher capacitance, depending on how often you intend to refresh. However, refresh operations look a LOT like vector operations, so probably all that would be needed is some logic to watch things and if the vector operations are NOT adequate for refreshing purposes, to make the sub-processors do some refreshing before continuing. If you work the process for just enough capacitance to support a pretty high refresh rate, then you don't take such a big hit on speed. Anyway, this looked like a third choice, along with going slow with DRAM and fast with SRAM. YOU CAN"T AFFORD TO USE CACHE outside > of a line buffer or two. You lose an order of magnitude in speed over what > can be done on the CPU chip. My entire design concept is based on NO CACHE. Several "big" items that they put a few of on a cpu chip (besides cache) > that > you can't afford in each processing element: barrel shifters, floating > point > units, even multipliers. I don't plan on using any of these, though I do plan on having just enough there to perform the various "step" operations to implement these at slow rates. Instruction broadcast latency and skew. If your achitecture is synchronous > you're looking at cross-chip times stuck into your instruction processing, > which means TWO orders of magnitude loss from on-chip cpu cycle times. I am planning on locally synchronous, globally asynchronous operation. Everything within a sub-processor will be pipelined synchronous, while everything connecting to them and connecting them together will be asynchronous. So > instead of a 10K speedup you get a 100 speedup I think that I can most most of the 10K speedup for most operations, but there ARE enough 100X operations to really slow it down for some types of programs. Still, a 100X processor is worth SOMETHING?! > The second mistake is to forget that processor and memory silicon fab use > > > different processes, the former optimized for fast transistors, the > latter > > > for dense trench capacitors. You won't get both at once -- you'll > give up > > > at > > > least a factor of ten trying to combine them over the radically > > > specialized > > > forms. > > > You need a collective function (max, sum, etc) tree or else you're doing > those > operations by Caxton Foster-style bit-serial algorithms with an > inescapable > bus turnaround between each bit. Unknown: Is there enough of this to justify the additional hardware? Also, with smart sub-processors they could work together (while jamming up the busses) to form the collective results at ~1% speed after the job has been first cut down by 10:1 by the multiple sub-processors forming the partial results. Hence, the overhead would by high for smaller arrays, but would be lost in the noise for arrays that are >>10K elements. How are you going to store an ordinary matrix? There's no layout where you > can > both add and multiply matrices without a raft of data motion. Making the row length equal to the interleaving ways keeps most of the activity in individual processors. Also, arranging the interleaving so that each processor services small scattered blocks provide a big boost for long and skinny matrices. Either you > build a general parallel communications network, which is expensive (think > Connection Machine) or your data-shuffling time kills you. My plan was to interconnect the ~10K processors in a 2D fashion with double busses, for a total of 400 busses. Again, let me mention graphics boards. They have native floating point, wide > memory bandwith, and hundreds of processing units, along with fairly > decent > data comm onboard. Speedups over the cpu can get up to 20 or so, once the > whole program is taken into account -- but for plenty of programs, the cpu > is > faster. I was attempting to put out a thought-architecture that would gradually become unbeatable thorough lengthy discussion (since 2006) and refining. This process seems to be working, and I can see that you will definitely make your mark on this process. > The third mistake is to forget that nobody knows how to program SIMD. > > > > > > I absolutely agree that programmers will quickly fall into two groups - > > those who "get it" and make the transition to writing vectorizable code > > fairly easily, and those who go into some other line of work. > > Well, it's a high art to write code for GPU's now but they have APIs (e.g. > OpenGL) that are a lor more adapted to the mainstream's capabilities. I > have > no doubt that associative processors would be the same way. My thought was that this would eventually become embedded in the language, and in the compiler that would recognize code that is trying to do something standard and simply drop in the human-optimized code to best do the job. In short, I agree with you. But I think that give the current push to parallelism of the multicore > style, > there will be some new paradigms coming. Who knows. I have been making and defending the statement that when my trial design finally firms up, that it will outperform an infinite sized multicore system. If this isn't already obvious, then please say so, and I will wear my poor fingers out explaining why I believe this. > > If you don't like math-like symbols, check out NIAL (http://www.nial.com/ > ). > But actually Matlab (and the scientific packages for languages like > Python) > have data parallel primitives, as well as higher-level functions like FFT > and > principal components analysis... The problem (that I see) with these is that it is all or nothing. You either take the canned operation, or write your own from scratch. In the real world of supercomputer *supermicro?) applications, they often need "little" enhancements to the standard "canned" operations. > Now frankly, a real associative processor (such as described in my thesis > -- > > > read it) would be very useful for AI. You can get close to faking it > > > nowadays > > > by getting a graphics card and programming it GPGPU-style. I quit > > > architecture and got back into the meat of AI because I think that > Moore's > > > law has won, and the cycles will be there before we can write the > > > software, > > > so it's a waste of time to try end-runs. > > > > > > Not according to Intel, who sees the ~4GHz limit as being a permanent > thing. > > I sat on my ideas for ~20 years, just waiting for this to happen and > blow > > Moore out of the water. > > Intel are going parallel too : > http://www.news.com/2100-1006_3-6119618.html > Intel pledges 80 cores in five years > (and it's a real working silicon prototype today) > (and note that each one of those has a floating-point unit) However, they haven't even suggested a way for us mere mortals to program it. In any case, my claim of outperforming an infinite assortment of cores still stands. A technical reporter who was up on the insides of Intel's 80-core product attended the panel at the last WORLDCOMP, as were two guys from Intel who bailed out when the questions got tough. The general agreement was that for most real-world applications, that there is little benefit from having >2 independent cores, regardless of their performance. This will remain so until applications are re-conceived for multi-core operation, but STILL they won't benefit from more than ~10 cores. Since the number of cores is blatantly unstable (Intel keeps promising more and more), to my knowledge, no one at all is working on multi-core implementations of the tough applications. Steve Richfield ------------------------------------------- agi Archives: http://www.listbox.com/member/archive/303/=now RSS Feed: http://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: http://www.listbox.com/member/?member_id=8660244&id_secret=101455710-f059c4 Powered by Listbox: http://www.listbox.com