On Tuesday 15 April 2008 07:36:56 pm, Steve Richfield wrote:
> As I understand things, speed requires low capacitance, which DRAM requires
> higher capacitance, depending on how often you intend to refresh. However,
> refresh operations look a LOT like vector operations, so probably all that
> would be needed is some logic to watch things and if the vector operations
> are NOT adequate for refreshing purposes, to make the sub-processors do some
> refreshing before continuing. If you work the process for just enough
> capacitance to support a pretty high refresh rate, then you don't take such
> a big hit on speed.  Anyway, this looked like a third choice, along with
> going slow with DRAM and fast with SRAM.

Even with our government megabucks we never imagined getting a custom 
process -- at best runs on some slightly out-of date fab line.

Process capacitance is a tradeoff too -- you can always just make the 
capacitors bigger! But even in fast transistor tech, DRAM is significantly 
slower. Sense amp latency...

BTW, if you really want to play with the tech, I believe (don't keep a finger 
on the latest) that there are chips you can get that are half memory and half 
FPGA that you could use to try your ideas out on. (and goddamn it, the fpgas 
are denser and faster than full custom was back in the 80s when I was doing 
this!)
 
> Several "big" items that they put a few of on a cpu chip (besides cache)
> > that
> > you can't afford in each processing element: barrel shifters, floating
> > point
> > units, even multipliers.
> 
> 
> I don't plan on using any of these, though I do plan on having just enough
> there to perform the various "step" operations to implement these at slow
> rates.

That works, but kills your speed by a factor of word length. it's a lot worse 
for floating point, because remember it's SIMD and you're doing data 
dependent shifts. 

> I am planning on locally synchronous, globally asynchronous operation.
> Everything within a sub-processor will be pipelined synchronous, while
> everything connecting to them and connecting them together will be
> asynchronous.

That's the right hardware choice, but it doesn't fit so well with the software 
architecture of an overall SIMD paradigm. You'd be better off going with a 
MIMD network of SIMD machines (a la the Sony/IBM Cell chip).

> I think that I can most most of the 10K speedup for most operations, but
> there ARE enough 100X operations to really slow it down for some types of
> programs. Still, a 100X processor is worth SOMETHING?!

Consider Amdahl's (first) Law: if 1/nth of your program is parallelizable but 
the other 1/mth is inherently serial, the best speed up you can get is m. 
Thus if even only 1% is unparallelizable, a speedup of 100 is the absolute 
best you can do. But if you've slowed down the central processor by a factor 
of 10 to make things easier for the parallel parts, you're only doing 10 
times better than an optimized purely serial machine.

> > You need a collective function (max, sum, etc) tree or else you're doing
> > those
> > operations by Caxton Foster-style bit-serial algorithms with an
> > inescapable
> > bus turnaround between each bit.
> 
> Unknown: Is there enough of this to justify the additional hardware? Also,
> with smart sub-processors they could work together (while jamming up the
> busses) to form the collective results at ~1% speed after the job has been
> first cut down by 10:1 by the multiple sub-processors forming the partial
> results. Hence, the overhead would by high for smaller arrays, but would be
> lost in the noise for arrays that are >>10K elements.

You need about twice the hardware to do a collective function tree (its a 
binary tree with the original PEs as its leaves. It's pipelineable, so you 
can run it pretty fast. Algorithmically, it makes a HUGE difference -- almost 
ALL the parallel algorithms my Rutgers CAM Project came up with depended on 
it. It's even a poor man's datacom network. (acts like a segmented bus)

> How are you going to store an ordinary matrix? There's no layout where you
> > can
> > both add and multiply matrices without a raft of data motion.
> 
> 
> Making the row length equal to the interleaving ways keeps most of the
> activity in individual processors. Also, arranging the interleaving so that
> each processor services small scattered blocks provide a big boost for long
> and skinny matrices.

You the machine designer don't get to say what shape the user's matrices can 
be (or nobody will use your machine). The problem I was pointing ot is that 
for matrix addition, say of A and B, the rows of A must be aligned (under the 
same processing elements) with the rows of B, but for multiplication, the 
rows of A must be aligned with the COLUMNS of B.
 
> My plan was to interconnect the ~10K processors in a 2D fashion with double
> busses, for a total of 400 busses.

In a 200x200 crossbar?  Not a bad design -- if they're electrically 
segmentable, and you also have a nearest-neighbor torus connection, you get 
something like the ICL/AMD DAP. Nice machine -- my project at Rutgers had 
one. That architecture can do parallel prefix (a key SIMD basic algorithm) in 
4th root of N time, essentially as good as logarithmic. 

Crossbars are expensive.  The DAP was a beige box the size of a PC, and cost 
more than a house. Moore's law killed it -- within 3 years, there were stock 
workstations as fast costing 10 times less.

> I was attempting to put out a thought-architecture that would gradually
> become unbeatable thorough lengthy discussion (since 2006) and refining.
> This process seems to be working, and I can see that you will definitely
> make your mark on this process.

I would love to see you succeed. I HATE programming GPUs. We basically came to 
the conclusion that associative processing saved you not so much processing 
time as programming time. Sure, you could build a fancy index or hash table 
and find your key in logarithmic or constant time. But with CAM, just throw 
it in the array and find it whenever you need it. We really didn't get much 
more than a factor of ten out of most practical apps, runtime -- but it cut 
PROGRAM COMPLEXITY by ten as well. And that's worth shooting for.

> But I think that give the current push to parallelism of the multicore
> > style,
> > there will be some new paradigms coming. Who knows.
> 
> I have been making and defending the statement that when my trial design
> finally firms up, that it will outperform an infinite sized multicore
> system. If this isn't already obvious, then please say so, and I will wear
> my poor fingers out explaining why I believe this.

Have a look at this 
article:http://arstechnica.com/articles/paedia/cpu/what-you-need-to-know-about-nehalem.ars/3
Assuming they keep this up, the NUMA multicore of the future will be, 
hardwarewise, a MIMD Connection Machine.  Which means that, simply by not 
using some of its capabilities, it will be equivalent to a SIMD Connection 
Machine. Which means, that by ignoring most of its communications 
capabilities, it will be equivalent to a processing-element in-memory 
associative processor.  Give me infinite cores, and I'll simulate your 
machine 100 times faster than it can run native (because I have hardware 
floating point, etc).

In ten years Moore's Law says Intel has 4K cores on one chip. With cache. With 
floating point. MIMD. Fully interconnected bus speed datacomm. That'll be 
hard to compete with.

In the meantime, look at this:
http://www.clearspeed.com/acceleration/technology/

> A technical reporter who was up on the insides of Intel's 80-core
> product attended the panel at the last WORLDCOMP, as were two guys from
> Intel who bailed out when the questions got tough. The general agreement was
> that for most real-world applications, that there is little benefit from
> having >2 independent cores, regardless of their performance. This will
> remain so until applications are re-conceived for multi-core operation, but
> STILL they won't benefit from more than ~10 cores. Since the number of cores
> is blatantly unstable (Intel keeps promising more and more), to my
> knowledge, no one at all is working on multi-core implementations of the
> tough applications.

http://view.eecs.berkeley.edu/wiki/Main_Page

Have a look at these folks.  Or read my AGI-08 paper (on automatic 
programming). I continue to hope/think that by then we'll have moved on to 
programming languages of a high enough level that the programmer won't know 
or care which physical model of machine he's using.

Or we'll have an AGI to do all the programming for us :-)

Josh

-------------------------------------------
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244&id_secret=101455710-f059c4
Powered by Listbox: http://www.listbox.com

Reply via email to