> On Aug 15, 2015, at 12:00 AM, Steve Richfield <[email protected]> 
> wrote:
> 
> On Fri, Aug 14, 2015 at 7:59 PM, J. Andrew Rogers <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Scatter/Gather, Map/Reduce, etc style computing models, whether implemented 
> on a GPU or Hadoop, are very poor for many types of parallel computing. It 
> only works if your problem is in the tiny subset that is embarrassingly 
> parallel in a topological sense. Most interesting algorithms are not in this 
> class.
> 
> This gets down to what is "interesting". Certainly weather prediction falls 
> into this class, as does neurological simulation. 


Yes, just about any algorithm that isn’t tractably representable as a DAG. 


> Put more simply, for most interesting applications of parallel computation, 
> the processors cannot be “independent of each other”.
> 
> So, you want absolutely maximum single-thread performance, so you can live 
> with as few threads as possible.


No, that is a completely separate matter. I’ve designed codes for complex 
inter-dependent computation that linearly scaled out on thousands of slow 
compute elements. While more speed is nice, you usually run out of bandwidth 
before you run out of CPU cycles, either memory bandwidth or network bandwidth, 
even when you are being hyper-efficient with your bandwidth. CPUs have very 
good theoretical IPC these days, which makes it difficult to keep them fed.


In pure data parallel algorithms, the resource you are trying to maximally 
exploit is CPU cycles. You never want an idle compute element. Before the last 
ten years or so, CPU cycles were often a throughput bottleneck.  

In pure topological parallel algorithms, the resource you are trying to 
maximally exploit is bandwidth. You never want an idle wire that can carry bits 
between compute elements. Designing code that ensures that all the wires in 
your system are efficiently saturated all the time is a pretty foreign concept 
to most programmers. Doing this well is an extremely rare skill, even among 
people that write software for supercomputers (and hence why a lot of 
supercomputing codes scale relatively poorly in practice).


> The ability to design efficient parallel data flows between compute elements 
> is rather important for many classes of algorithm. Join algorithms, for 
> example, which are central to graph traversal.
> 
> With FPGAs you can design ANYTHING you can imagine. 


Sure, but many things are faster on commodity fixed silicon. I have a friend 
who designs FPGA codes for high-performance computing applications and it is 
not a good market to be in. This market is getting seriously pinched by (1) 
increasingly rich ISAs in fixed silicon that adequately replace most of what an 
FPGA can do for much cheaper and (2) the large and growing percentage of codes 
that are bandwidth limited. 

On the first point, sophisticated vectorized BMI-type ISA extensions give you 
most of the building blocks that an FPGA would give you, and most of the 
inefficiencies of constructing algorithms this way versus an FPGA is made up 
for by the fact that you can run these instructions at 4GHz. And it is *much* 
easier and cheaper to design codes in this domain than to design circuits for 
an FPGA. Intel’s AVX-512 in particular adds a lot of functionality for 
constructing highly pipelined bit-parallel data flow algorithms, by design. 


> This, by the way, is why barrel processing architectures are far more 
> interesting than GPU architectures (or FPGAs for that matter) when it comes 
> to parallel computations. Barrel processors can efficiently express very 
> sophisticated topological parallelism; GPUs are optimized for pure data 
> parallelism.
> 
> I don't (yet) see this. The only successful application of barrels I know of 
> was in peripheral processors attached to some early supercomputers designed 
> by Cray. With really cheap but slow processors, they could dedicate a 
> processor to every disk or tape drive - which is now no problem because they 
> now already have several micros.
> 
> Barrels make data chaining impossible, at which point you have thrown away an 
> order of magnitude in memory bandwidth. 


First, the primary historical example of a barrel processor for general purpose 
computing was Cray’s (née Tera’s) MTA architecture, which among other 
disadvantages was not implemented on a modern silicon process which made it 
difficult to compete with CPUs on performance. It is not a reflection of the 
abstract capabilities of these types of computing architectures. There are a 
couple other barrel processor implementations out there, all exotic. It is why 
almost no programmers have ever designed code for one.

Second, the few programmers that have worked on barrel processors tend to 
design codes for them the same as for multithreading on CPUs, in part because 
that is how they are marketed. Proper code design for barrel processors is 
fundamentally different than for CPUs; you can run CPU-targeted code on a 
barrel processor and it will run adequately, but you throw all the 
architectural efficiencies out the window. Consequently, the performance does 
not seem that special to most programmers that dabble with them. 

The only modern barrel processor is Intel's Xeon Phi, particularly the new 
version that will be coming out. Again, if you design codes like for a CPU, the 
performance will be unremarkable. And unfortunately, that is how Intel is 
positioning it to developers — like a CPU with a lot of threads — when it 
really isn’t. So most programmers will design terribly inefficient algorithms 
for it.

IMHO, barrel processors are the most efficient modern computing architecture in 
the abstract. The problem is that there seem to be a dozen people in the world 
that know how to properly design codes for them. This is ironic because they 
are much easier to design for than GPUs or similar but no one wants to market 
them as not being CPU-like and they fake it well at first blush.

Also, idiomatic C/C++ etc has a CPU-targeted orthodoxy built into it and is all 
programmers know. C++ works fine for barrel processors but idiomatic code looks 
different than for CPUs and no one teaches idiomatic C/C++ for barrel 
processors. 



> Long ago ~1982 I presented a paper (the closing presentation at the 1st NN 
> conference in San Diego) showing that if you chopped a human brain up into 
> many processors that communicated over a single bus, that the communications 
> rate would be only ~10^9 brief messages/second. Of course this was 
> inconceivably fast for that time, but probably doable now.


We are not even close to that today, and probably never will be due to physics. 
Any communication system that large is inherently parallel. 





-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Reply via email to