Josh,

On 4/15/08, J Storrs Hall, PhD <[EMAIL PROTECTED]> wrote:
>
> Either you're using static RAM (and getting a big hit in density and
> power) or
> DRAM, and getting a big hit in speed.


I have taken some chip design courses but have never actually designed any
chips, so please correct any misconceptions that I may exhibit in the
following discussion.

As I understand things, speed requires low capacitance, which DRAM requires
higher capacitance, depending on how often you intend to refresh. However,
refresh operations look a LOT like vector operations, so probably all that
would be needed is some logic to watch things and if the vector operations
are NOT adequate for refreshing purposes, to make the sub-processors do some
refreshing before continuing. If you work the process for just enough
capacitance to support a pretty high refresh rate, then you don't take such
a big hit on speed.  Anyway, this looked like a third choice, along with
going slow with DRAM and fast with SRAM.

YOU CAN"T AFFORD TO USE CACHE outside
> of a line buffer or two. You lose an order of magnitude in speed over what
> can be done on the CPU chip.


My entire design concept is based on NO CACHE.

Several "big" items that they put a few of on a cpu chip (besides cache)
> that
> you can't afford in each processing element: barrel shifters, floating
> point
> units, even multipliers.


I don't plan on using any of these, though I do plan on having just enough
there to perform the various "step" operations to implement these at slow
rates.

Instruction broadcast latency and skew. If your achitecture is synchronous
> you're looking at cross-chip times stuck into your instruction processing,
> which means TWO orders of magnitude loss from on-chip cpu cycle times.


I am planning on locally synchronous, globally asynchronous operation.
Everything within a sub-processor will be pipelined synchronous, while
everything connecting to them and connecting them together will be
asynchronous.

So
> instead of a 10K speedup you get a 100 speedup


I think that I can most most of the 10K speedup for most operations, but
there ARE enough 100X operations to really slow it down for some types of
programs. Still, a 100X processor is worth SOMETHING?!

> The second mistake is to forget that processor and memory silicon fab use
> > > different processes, the former optimized for fast transistors, the
> latter
> > > for dense trench capacitors.  You won't get both at once -- you'll
> give up
> > > at
> > > least a factor of ten trying to combine them over the radically
> > > specialized
> > > forms.
> >
> You need a collective function (max, sum, etc) tree or else you're doing
> those
> operations by Caxton Foster-style bit-serial algorithms with an
> inescapable
> bus turnaround between each bit.


Unknown: Is there enough of this to justify the additional hardware? Also,
with smart sub-processors they could work together (while jamming up the
busses) to form the collective results at ~1% speed after the job has been
first cut down by 10:1 by the multiple sub-processors forming the partial
results. Hence, the overhead would by high for smaller arrays, but would be
lost in the noise for arrays that are >>10K elements.

How are you going to store an ordinary matrix? There's no layout where you
> can
> both add and multiply matrices without a raft of data motion.


Making the row length equal to the interleaving ways keeps most of the
activity in individual processors. Also, arranging the interleaving so that
each processor services small scattered blocks provide a big boost for long
and skinny matrices.

Either you
> build a general parallel communications network, which is expensive (think
> Connection Machine) or your data-shuffling time kills you.


My plan was to interconnect the ~10K processors in a 2D fashion with double
busses, for a total of 400 busses.

Again, let me mention graphics boards. They have native floating point, wide
> memory bandwith, and hundreds of processing units, along with fairly
> decent
> data comm onboard. Speedups over the cpu can get up to 20 or so, once the
> whole program is taken into account -- but for plenty of programs, the cpu
> is
> faster.


I was attempting to put out a thought-architecture that would gradually
become unbeatable thorough lengthy discussion (since 2006) and refining.
This process seems to be working, and I can see that you will definitely
make your mark on this process.

> The third mistake is to forget that nobody knows how to program SIMD.
> >
> >
> > I absolutely agree that programmers will quickly fall into two groups -
> > those who "get it" and make the transition to writing vectorizable code
> > fairly easily, and those who go into some other line of work.
>
> Well, it's a high art to write code for GPU's now but they have APIs (e.g.
> OpenGL) that are a lor more adapted to the mainstream's capabilities. I
> have
> no doubt that associative processors would be the same way.


My thought was that this would eventually become embedded in the language,
and in the compiler that would recognize code that is trying to do something
standard and simply drop in the human-optimized code to best do the job. In
short, I agree with you.

But I think that give the current push to parallelism of the multicore
> style,
> there will be some new paradigms coming. Who knows.


I have been making and defending the statement that when my trial design
finally firms up, that it will outperform an infinite sized multicore
system. If this isn't already obvious, then please say so, and I will wear
my poor fingers out explaining why I believe this.

>
> If you don't like math-like symbols, check out NIAL (http://www.nial.com/
> ).
> But actually Matlab (and the scientific packages for languages like
> Python)
> have data parallel primitives, as well as higher-level functions like FFT
> and
> principal components analysis...


The problem (that I see) with these is that it is all or nothing. You either
take the canned operation, or write your own from scratch. In the real world
of supercomputer *supermicro?) applications, they often need "little"
enhancements to the standard "canned" operations.

> Now frankly, a real associative processor (such as described in my thesis
> --
> > > read it) would be very useful for AI. You can get close to faking it
> > > nowadays
> > > by getting a graphics card and programming it GPGPU-style. I quit
> > > architecture and got back into the meat of AI because I think that
> Moore's
> > > law has won, and the cycles will be there before we can write the
> > > software,
> > > so it's a waste of time to try end-runs.
> >
> >
> > Not according to Intel, who sees the ~4GHz limit as being a permanent
> thing.
> > I sat on my ideas for ~20 years, just waiting for this to happen and
> blow
> > Moore out of the water.
>
> Intel are going parallel too :
> http://www.news.com/2100-1006_3-6119618.html
> Intel pledges 80 cores in five years
> (and it's a real working silicon prototype today)
> (and note that each one of those has a floating-point unit)


However, they haven't even suggested a way for us mere mortals to program
it. In any case, my claim of outperforming an infinite assortment of cores
still stands.

A technical reporter who was up on the insides of Intel's 80-core
product attended the panel at the last WORLDCOMP, as were two guys from
Intel who bailed out when the questions got tough. The general agreement was
that for most real-world applications, that there is little benefit from
having >2 independent cores, regardless of their performance. This will
remain so until applications are re-conceived for multi-core operation, but
STILL they won't benefit from more than ~10 cores. Since the number of cores
is blatantly unstable (Intel keeps promising more and more), to my
knowledge, no one at all is working on multi-core implementations of the
tough applications.

Steve Richfield

-------------------------------------------
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244&id_secret=101455710-f059c4
Powered by Listbox: http://www.listbox.com

Reply via email to