> On Aug 15, 2015, at 12:00 AM, Steve Richfield <[email protected]> > wrote: > > On Fri, Aug 14, 2015 at 7:59 PM, J. Andrew Rogers <[email protected] > <mailto:[email protected]>> wrote: > > Scatter/Gather, Map/Reduce, etc style computing models, whether implemented > on a GPU or Hadoop, are very poor for many types of parallel computing. It > only works if your problem is in the tiny subset that is embarrassingly > parallel in a topological sense. Most interesting algorithms are not in this > class. > > This gets down to what is "interesting". Certainly weather prediction falls > into this class, as does neurological simulation.
Yes, just about any algorithm that isn’t tractably representable as a DAG. > Put more simply, for most interesting applications of parallel computation, > the processors cannot be “independent of each other”. > > So, you want absolutely maximum single-thread performance, so you can live > with as few threads as possible. No, that is a completely separate matter. I’ve designed codes for complex inter-dependent computation that linearly scaled out on thousands of slow compute elements. While more speed is nice, you usually run out of bandwidth before you run out of CPU cycles, either memory bandwidth or network bandwidth, even when you are being hyper-efficient with your bandwidth. CPUs have very good theoretical IPC these days, which makes it difficult to keep them fed. In pure data parallel algorithms, the resource you are trying to maximally exploit is CPU cycles. You never want an idle compute element. Before the last ten years or so, CPU cycles were often a throughput bottleneck. In pure topological parallel algorithms, the resource you are trying to maximally exploit is bandwidth. You never want an idle wire that can carry bits between compute elements. Designing code that ensures that all the wires in your system are efficiently saturated all the time is a pretty foreign concept to most programmers. Doing this well is an extremely rare skill, even among people that write software for supercomputers (and hence why a lot of supercomputing codes scale relatively poorly in practice). > The ability to design efficient parallel data flows between compute elements > is rather important for many classes of algorithm. Join algorithms, for > example, which are central to graph traversal. > > With FPGAs you can design ANYTHING you can imagine. Sure, but many things are faster on commodity fixed silicon. I have a friend who designs FPGA codes for high-performance computing applications and it is not a good market to be in. This market is getting seriously pinched by (1) increasingly rich ISAs in fixed silicon that adequately replace most of what an FPGA can do for much cheaper and (2) the large and growing percentage of codes that are bandwidth limited. On the first point, sophisticated vectorized BMI-type ISA extensions give you most of the building blocks that an FPGA would give you, and most of the inefficiencies of constructing algorithms this way versus an FPGA is made up for by the fact that you can run these instructions at 4GHz. And it is *much* easier and cheaper to design codes in this domain than to design circuits for an FPGA. Intel’s AVX-512 in particular adds a lot of functionality for constructing highly pipelined bit-parallel data flow algorithms, by design. > This, by the way, is why barrel processing architectures are far more > interesting than GPU architectures (or FPGAs for that matter) when it comes > to parallel computations. Barrel processors can efficiently express very > sophisticated topological parallelism; GPUs are optimized for pure data > parallelism. > > I don't (yet) see this. The only successful application of barrels I know of > was in peripheral processors attached to some early supercomputers designed > by Cray. With really cheap but slow processors, they could dedicate a > processor to every disk or tape drive - which is now no problem because they > now already have several micros. > > Barrels make data chaining impossible, at which point you have thrown away an > order of magnitude in memory bandwidth. First, the primary historical example of a barrel processor for general purpose computing was Cray’s (née Tera’s) MTA architecture, which among other disadvantages was not implemented on a modern silicon process which made it difficult to compete with CPUs on performance. It is not a reflection of the abstract capabilities of these types of computing architectures. There are a couple other barrel processor implementations out there, all exotic. It is why almost no programmers have ever designed code for one. Second, the few programmers that have worked on barrel processors tend to design codes for them the same as for multithreading on CPUs, in part because that is how they are marketed. Proper code design for barrel processors is fundamentally different than for CPUs; you can run CPU-targeted code on a barrel processor and it will run adequately, but you throw all the architectural efficiencies out the window. Consequently, the performance does not seem that special to most programmers that dabble with them. The only modern barrel processor is Intel's Xeon Phi, particularly the new version that will be coming out. Again, if you design codes like for a CPU, the performance will be unremarkable. And unfortunately, that is how Intel is positioning it to developers — like a CPU with a lot of threads — when it really isn’t. So most programmers will design terribly inefficient algorithms for it. IMHO, barrel processors are the most efficient modern computing architecture in the abstract. The problem is that there seem to be a dozen people in the world that know how to properly design codes for them. This is ironic because they are much easier to design for than GPUs or similar but no one wants to market them as not being CPU-like and they fake it well at first blush. Also, idiomatic C/C++ etc has a CPU-targeted orthodoxy built into it and is all programmers know. C++ works fine for barrel processors but idiomatic code looks different than for CPUs and no one teaches idiomatic C/C++ for barrel processors. > Long ago ~1982 I presented a paper (the closing presentation at the 1st NN > conference in San Diego) showing that if you chopped a human brain up into > many processors that communicated over a single bus, that the communications > rate would be only ~10^9 brief messages/second. Of course this was > inconceivably fast for that time, but probably doable now. We are not even close to that today, and probably never will be due to physics. Any communication system that large is inherently parallel. ------------------------------------------- AGI Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424 Modify Your Subscription: https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657 Powered by Listbox: http://www.listbox.com
