Josh,

On 4/15/08, J Storrs Hall, PhD <[EMAIL PROTECTED]> wrote:
>
> On Tuesday 15 April 2008 07:36:56 pm, Steve Richfield wrote:
> > As I understand things, speed requires low capacitance, which DRAM
> requires
> > higher capacitance, depending on how often you intend to refresh.
> However,
> > refresh operations look a LOT like vector operations, so probably all
> that
> > would be needed is some logic to watch things and if the vector
> operations
> > are NOT adequate for refreshing purposes, to make the sub-processors do
> some
> > refreshing before continuing. If you work the process for just enough
> > capacitance to support a pretty high refresh rate, then you don't take
> such
> > a big hit on speed.  Anyway, this looked like a third choice, along with
> > going slow with DRAM and fast with SRAM.
>
> Even with our government megabucks we never imagined getting a custom
> process -- at best runs on some slightly out-of date fab line.
>
> Process capacitance is a tradeoff too -- you can always just make the
> capacitors bigger! But even in fast transistor tech, DRAM is significantly
> slower. Sense amp latency...


My dog in this fight is the architecture, which will be much the same
regardless of the process. Hence, unless/until I see how the process might
force a radical change in architecture, I'll simply leave that issue to
others who are MUCH more skilled at this than I am.

BTW, if you really want to play with the tech, I believe (don't keep a
> finger
> on the latest) that there are chips you can get that are half memory and
> half
> FPGA that you could use to try your ideas out on. (and goddamn it, the
> fpgas
> are denser and faster than full custom was back in the 80s when I was
> doing
> this!)


The problem with every such chip that I have seen is that I need many
separate parallel banks of memory per ALU. However, the products out there
only offer a single, and sometimes two banks. This might be fun to play
with, but wouldn't be of any practical use that I can see.

> Several "big" items that they put a few of on a cpu chip (besides cache)
> > > that
> > > you can't afford in each processing element: barrel shifters, floating
> > > point
> > > units, even multipliers.
> >
> >
> > I don't plan on using any of these, though I do plan on having just
> enough
> > there to perform the various "step" operations to implement these at
> slow
> > rates.
>
> That works, but kills your speed by a factor of word length. it's a lot
> worse
> for floating point, because remember it's SIMD and you're doing data
> dependent shifts.


Agreed on all points, with the following comments:
1.  By making the sub-processors autonomous within code snippets, there is
no cost to data dependency and even possibly some advantage as processors
become temporally separated, thereby reducing contention for bus usage, etc.
2.  I openly accept the 1-2 orders of magnitude speed loss in order to gain
the 4 orders of magnitude speed gain to lose it from, for a net gain of 2-3
orders of magnitude in speed even for the slowest of operations.

> I am planning on locally synchronous, globally asynchronous operation.
> > Everything within a sub-processor will be pipelined synchronous, while
> > everything connecting to them and connecting them together will be
> > asynchronous.
>
> That's the right hardware choice, but it doesn't fit so well with the
> software
> architecture of an overall SIMD paradigm.


Done right, the programmer would never see it. Remember, I plan to implement
coordination points, where everything stops until all sub-processors are to
the coordination point, whereupon everything continues. The compiler would
just drop one of these in wherever needed to keep things straight.

You'd be better off going with a
> MIMD network of SIMD machines (a la the Sony/IBM Cell chip).


Some might reasonably argue that mine is an MIMD, as the sub-processors are
somewhat autonomous. As I explained, I think this approach brings the best
of both SIMD and MIMD worlds.

Queueing theory says that you are best with a minimum number of the fastest
possible "servers" (processors) to serve a queue of work. I think that my
10K proposal produces the fastest processors, and putting several on a wafer
provides several of them for the most horrendous possible applications. It
appears (to me) that such a wafer, if well designed, would provide the
compute power to start working on AGI in earnest.

> I think that I can most most of the 10K speedup for most operations, but
> > there ARE enough 100X operations to really slow it down for some types
> of
> > programs. Still, a 100X processor is worth SOMETHING?!
>
> Consider Amdahl's (first) Law: if 1/nth of your program is parallelizable
> but
> the other 1/mth is inherently serial, the best speed up you can get is m.
> Thus if even only 1% is unparallelizable, a speedup of 100 is the absolute
> best you can do. But if you've slowed down the central processor by a
> factor
> of 10 to make things easier for the parallel parts, you're only doing 10
> times better than an optimized purely serial machine.


Still, not bad for a single core.

However, large neural networks are inherently parallel things, so Amdahl's
first law shouldn't be a factor.


> > > You need a collective function (max, sum, etc) tree or else you're
> doing
> > > those
> > > operations by Caxton Foster-style bit-serial algorithms with an
> > > inescapable
> > > bus turnaround between each bit.
> >
> > Unknown: Is there enough of this to justify the additional hardware?
> Also,
> > with smart sub-processors they could work together (while jamming up the
> > busses) to form the collective results at ~1% speed after the job has
> been
> > first cut down by 10:1 by the multiple sub-processors forming the
> partial
> > results. Hence, the overhead would by high for smaller arrays, but would
> be
> > lost in the noise for arrays that are >>10K elements.
>
> You need about twice the hardware to do a collective function tree (its a
> binary tree with the original PEs as its leaves. It's pipelineable, so you
> can run it pretty fast. Algorithmically, it makes a HUGE difference --
> almost
> ALL the parallel algorithms my Rutgers CAM Project came up with depended
> on
> it. It's even a poor man's datacom network. (acts like a segmented bus)
>
> > How are you going to store an ordinary matrix? There's no layout where
> you
> > > can
> > > both add and multiply matrices without a raft of data motion.
> >
> >
> > Making the row length equal to the interleaving ways keeps most of the
> > activity in individual processors. Also, arranging the interleaving so
> that
> > each processor services small scattered blocks provide a big boost for
> long
> > and skinny matrices.
>
> You the machine designer don't get to say what shape the user's matrices
> can
> be (or nobody will use your machine).


However, the compiler can round row lengths up to the next multiple of the
interleaving factor.

The problem I was pointing ot is that
> for matrix addition, say of A and B, the rows of A must be aligned (under
> the
> same processing elements) with the rows of B, but for multiplication, the
> rows of A must be aligned with the COLUMNS of B.


... which automatically happens when the rows of A just happen to match the
interleaving. Compilers could over-dimension arrays to make this so. Note
the use of "Multiple Tag Mode" on antique IBM-709/7090 computers, for which
you had to do the same to make it useful.

> My plan was to interconnect the ~10K processors in a 2D fashion with
> double
> > busses, for a total of 400 busses.
>
> In a 200x200 crossbar?  Not a bad design -- if they're electrically
> segmentable, and you also have a nearest-neighbor torus connection, you
> get
> something like the ICL/AMD DAP. Nice machine -- my project at Rutgers had
> one. That architecture can do parallel prefix (a key SIMD basic algorithm)
> in
> 4th root of N time, essentially as good as logarithmic.
>
> Crossbars are expensive.  The DAP was a beige box the size of a PC, and
> cost
> more than a house. Moore's law killed it -- within 3 years, there were
> stock
> workstations as fast costing 10 times less.


My design all fits on a single chip - or it will never work.

> I was attempting to put out a thought-architecture that would gradually
> > become unbeatable thorough lengthy discussion (since 2006) and refining.
> > This process seems to be working, and I can see that you will definitely
> > make your mark on this process.
>
> I would love to see you succeed. I HATE programming GPUs. We basically
> came to
> the conclusion that associative processing saved you not so much
> processing
> time as programming time. Sure, you could build a fancy index or hash
> table
> and find your key in logarithmic or constant time. But with CAM, just
> throw
> it in the array and find it whenever you need it. We really didn't get
> much
> more than a factor of ten out of most practical apps, runtime -- but it
> cut
> PROGRAM COMPLEXITY by ten as well. And that's worth shooting for.


Observation: I am a front-runner type, looking to find the roads that lead
to here I want to go. This in preference to actually packing up the luggage
and actually draging it down that road. You sound like the sort that once
the things is sort of roughed out, likes to polish it up and make it as good
as possible. Further, you have a LOT more actual experience doing this sort
of thing with whizzbang chips than I do, and you actually understood what I
was proposing with a minimum of explanation.

Question: Do you have any interest in helping transform my rather rough
concept to a sufficiently detailed road map that anyone with money and an
interest in AGI would absolutely HAVE to fund it? I simply don't see Intel
or anyone else currently running in a direction that will EVER produce an
AGI-capable processor, yet my approach looks like it has a good chance if
only I can smooth out the rough edges and eventually find someone to pay the
bills - things that you obviously have a lot of experience doing. Any
interest?

> But I think that give the current push to parallelism of the multicore
> > > style,
> > > there will be some new paradigms coming. Who knows.
> >
> > I have been making and defending the statement that when my trial design
> > finally firms up, that it will outperform an infinite sized multicore
> > system. If this isn't already obvious, then please say so, and I will
> wear
> > my poor fingers out explaining why I believe this.


>From here on are lots of interesting hyperlinks that I absolutely must read,
but I currently don't have the time. I will respond further when I have had
a chance to read them.

Steve Richfield

Have a look at this
> article:
> http://arstechnica.com/articles/paedia/cpu/what-you-need-to-know-about-nehalem.ars/3
> Assuming they keep this up, the NUMA multicore of the future will be,
> hardwarewise, a MIMD Connection Machine.  Which means that, simply by not
> using some of its capabilities, it will be equivalent to a SIMD Connection
> Machine. Which means, that by ignoring most of its communications
> capabilities, it will be equivalent to a processing-element in-memory
> associative processor.  Give me infinite cores, and I'll simulate your
> machine 100 times faster than it can run native (because I have hardware
> floating point, etc).
>
> In ten years Moore's Law says Intel has 4K cores on one chip. With cache.
> With
> floating point. MIMD. Fully interconnected bus speed datacomm. That'll be
> hard to compete with.
>
> In the meantime, look at this:
> http://www.clearspeed.com/acceleration/technology/
>
> > A technical reporter who was up on the insides of Intel's 80-core
> > product attended the panel at the last WORLDCOMP, as were two guys from
> > Intel who bailed out when the questions got tough. The general agreement
> was
> > that for most real-world applications, that there is little benefit from
> > having >2 independent cores, regardless of their performance. This will
> > remain so until applications are re-conceived for multi-core operation,
> but
> > STILL they won't benefit from more than ~10 cores. Since the number of
> cores
> > is blatantly unstable (Intel keeps promising more and more), to my
> > knowledge, no one at all is working on multi-core implementations of the
> > tough applications.
>
> http://view.eecs.berkeley.edu/wiki/Main_Page
>
> Have a look at these folks.  Or read my AGI-08 paper (on automatic
> programming). I continue to hope/think that by then we'll have moved on to
> programming languages of a high enough level that the programmer won't
> know
> or care which physical model of machine he's using.
>
> Or we'll have an AGI to do all the programming for us :-)
>
> Josh
>
> -------------------------------------------
> agi
> Archives: http://www.listbox.com/member/archive/303/=now
> RSS Feed: http://www.listbox.com/member/archive/rss/303/
> Modify Your Subscription:
> http://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com
>

-------------------------------------------
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244&id_secret=101455710-f059c4
Powered by Listbox: http://www.listbox.com

Reply via email to