Josh,

On 4/16/08, J Storrs Hall, PhD <[EMAIL PROTECTED]> wrote:

> On Wednesday 16 April 2008 04:15:40 am, Steve Richfield wrote:
>
> > The problem with every such chip that I have seen is that I need many
> > separate parallel banks of memory per ALU. However, the products out
> there
> > only offer a single, and sometimes two banks. This might be fun to play
> > with, but wouldn't be of any practical use that I can see.
>
> How much memory are you thinking of, total? The current best is 2 Gbits on
> a
> chip, and that's pushing the density side of the equation big-time.


That is only because they are making defect-free (or single-defect)
memories, instead of just accepting a tolerable defect RATE and configuring
the defects into the ether with clever memory and processor management. Once
you start looking at quarter wafers, you get another 2-3 orders of
magnitude.



> Divide by
> 2 (room for those processors and busses)


My thinking was that the processors would be <<50%, maybe 15%. The busses
would be on two layers of metalization reserved for wiring. Note that a 2D
arrangement is less dense than the "random" wiring of the memories and
processors, and can avoid the most dense regions of those structures, and
hence the busses should be FREE.


> x 10000 x 32


I was thinking 16-bits/word, a concenient word size for log operations.



> and you get 3k words
> per processor.  You can't even put a 10k x 10k square matrix on the chip.
> So
> you're bottlenecked by the off-chip pipe.


My thinking is that EVERYTHING must be done on a SINGLE quarter-wafer "chip"
if this concept is to work reasonably well. Once you start cutting the
things up, all is probably lost.



> > > ... architecture of an overall SIMD paradigm.
> >
> > Done right, the programmer would never see it. Remember, I plan to
> implement
> > coordination points, where everything stops until all sub-processors are
> to
> > the coordination point, whereupon everything continues. The compiler
> would
> > just drop one of these in wherever needed to keep things straight.
>
> Can't argue with that!  Right now, I think there's more upside on the
> smart
> compiler side of the equation than the hardware, but don't let that stop
> you.


My goal is to define a much better hardware architecture for which at least
one approach is evident to compile good real-world code. As I may have
mentioned earlier, I have MUCH better vector compiler credentials than chip
design credentials.



> > Queueing theory says that you are best with a minimum number of the
> fastest
> > possible "servers" (processors) to serve a queue of work. I think that
> my
> > 10K proposal produces the fastest processors, and putting several on a
> wafer
> > provides several of them for the most horrendous possible applications.
> It
> > appears (to me) that such a wafer, if well designed, would provide the
> > compute power to start working on AGI in earnest.
>
> WSI (wafer-scale integration) has been tried for decades -- we looked at
> it
> inthe 80s.


I think I understand why Amdahl and others failed. I see NO WAY to make this
work w/o logarithmic arithmetic, but no one even wants to THINK about
logarithmic arithmetic, let alone talk about it. However, logarithmic
arithmetic is PERFECT for neural network, image recognition, and many AI
areas. Of course, my approach would also do traditional FP, but at only 100x
faster.



> There are some complex reasons, having to do with defect density
> and the like, that they still, e.g., cut them into chips and then turn
> around
> and rewire 8 of those chips onto a DIMM.


The problem you mention comes from the presumed lack of testing at power-on,
which I see as essential to make real-world WSI work.



> > However, large neural networks are inherently parallel things, so
> Amdahl's
> > first law shouldn't be a factor.
>
> NNs have two properties you may stumble over. They involve lots of
> multiplication;


... the EASIEST operation with logarithmic arithmetic.



> and they involve lots of datacomm.


... with "lots" being by current standards. I know of no NN application that
would outrun the DMA in a PC.

You should go back and re-read my paper, as the following comments all
reflect that you haven't yet realized the ASTRONOMICAL impact of
pipelined logarithmic ALUs. These can easily do a selective
multiply-accumulate every clock cycle with memory-fabrication technology.
The multiply-step discussion ONLY applies to traditional FP for I/O,
compatibility, and rare high-precision computations. I don't care how slow
operations that I don't use are.
Steve Richfield
==================


> Consider two architectures: a single ALU with a fast multiplier (for
> 32-bit
> words, 1024 full-adder circuits) versus 32 ALUs that each have a 32-bit
> adder
> (again for a total of 1024 FAs) and do a mult by a 32-cycle shift-&-add.
> Both architectures can do 32 mults in 32 cycles. But:
> the serial can do 5 mults in 5 cycles, but the parallel still needs 32.
> The
> serial can do 33 mults in 33 cycles, but the parallel needs 64.
> The amount of hardware isn't really the same. The serial needs one
> instruction
> decoder and one memory addresser -- the parallel needs 32 of each. So on
> the
> same real estate you can bulk up the drivers and make the serial faster.
> And finally, the serial suffers no slowdown at all when I interleave a
> shuffle-exchange step (to do an FFT) -- the parallel bets bogged down in
> datacomm.
>
> There's an interesting variant on the parallel version that we worked on
> specifically for matrix mult or neural nets (same basic operation).  The
> overflow of each of the adders fed into the bottom of an adder tree, which
> was one bit wide at the leaves, two bits at the next level up, etc, with a
> full-word accumulator at the top. So we could do fully pipelined dot
> products
> for as long as we had the data to crunch.
>
> Which was all very cute but went the way of the Connection Machine for
> much
> the same reason. (but we went faster, heh heh)
>
> > ... which automatically happens when the rows of A just happen to match
> the
> > interleaving. Compilers could over-dimension arrays to make this so.
> Note
> > the use of "Multiple Tag Mode" on antique IBM-709/7090 computers, for
> which
> > you had to do the same to make it useful.
>
> This helps if you're multiplying NxN matrices with only N processors, but
> does
> you no good if you actually have enough processors to have one element per
> processor!
>
> > My design all fits on a single chip - or it will never work.
>
> See query about memory size above.
>
> > Observation: I am a front-runner type, looking to find the roads that
> lead
> > to here I want to go. This in preference to actually packing up the
> luggage
> > and actually draging it down that road. You sound like the sort that
> once
> > the things is sort of roughed out, likes to polish it up and make it as
> good
> > as possible. Further, you have a LOT more actual experience doing this
> sort
> > of thing with whizzbang chips than I do, and you actually understood
> what I
> > was proposing with a minimum of explanation.
> >
> > Question: Do you have any interest in helping transform my rather rough
> > concept to a sufficiently detailed road map that anyone with money and
> an
> > interest in AGI would absolutely HAVE to fund it? I simply don't see
> Intel
> > or anyone else currently running in a direction that will EVER produce
> an
> > AGI-capable processor, yet my approach looks like it has a good chance
> if
> > only I can smooth out the rough edges and eventually find someone to pay
> the
> > bills - things that you obviously have a lot of experience doing. Any
> > interest?
>
> Sorry -- I'm a front-runner type myself. I write books about robot ethics
> and
> nanotechnology. When I was at Rutgers doing this I had people working for
> me
> to do the details.
>
> And I'm currently doing AI, which is a lot more interesting than parallel
> architectures, to my mind. In the 80s I felt the biggest stumbling block
> to
> AI was processing power (and I agree with myself in retrospect). But as
> previously mentioned, the stuff becoming available now is on the verge of
> solving that problem, and we can spend our time figuring out what to do
> with
> it.
>
> Which reminds me that I've spent a lot more time on this than I should,
> and so
> I will have to let the whole subject, interesting as it is, go at that and
> get back to work.  Have fun with your project and good luck!
>
> Josh
>
> -------------------------------------------
> agi
> Archives: http://www.listbox.com/member/archive/303/=now
> RSS Feed: http://www.listbox.com/member/archive/rss/303/
> Modify Your Subscription:
> http://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com
>

-------------------------------------------
agi
Archives: http://www.listbox.com/member/archive/303/=now
RSS Feed: http://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
http://www.listbox.com/member/?member_id=8660244&id_secret=101455710-f059c4
Powered by Listbox: http://www.listbox.com

Reply via email to