On Sat, Aug 15, 2015 at 12:54 PM, J. Andrew Rogers <[email protected]>
wrote:

>
>
> On Aug 15, 2015, at 12:00 AM, Steve Richfield <[email protected]>
> wrote:
>
> On Fri, Aug 14, 2015 at 7:59 PM, J. Andrew Rogers <[email protected]>
> wrote:
>
>>
>> Scatter/Gather, Map/Reduce, etc style computing models, whether
>> implemented on a GPU or Hadoop, are very poor for many types of parallel
>> computing. It only works if your problem is in the tiny subset that is
>> embarrassingly parallel in a topological sense. Most interesting algorithms
>> are not in this class.
>>
>
> This gets down to what is "interesting". Certainly weather prediction
> falls into this class, as does neurological simulation.
>
>
>
> Yes, just about any algorithm that isn’t tractably representable as a DAG.
>
>
> Put more simply, for most interesting applications of parallel
>> computation, the processors cannot be “independent of each other”.
>>
>
> So, you want absolutely maximum single-thread performance, so you can live
> with as few threads as possible.
>
>
>
> No, that is a completely separate matter. I’ve designed codes for complex
> inter-dependent computation that linearly scaled out on thousands of slow
> compute elements. While more speed is nice, you usually run out of
> bandwidth before you run out of CPU cycles, either memory bandwidth or
> network bandwidth, even when you are being hyper-efficient with your
> bandwidth. CPUs have very good theoretical IPC these days, which makes it
> difficult to keep them fed.
>
>
> In pure data parallel algorithms, the resource you are trying to maximally
> exploit is CPU cycles. You never want an idle compute element. Before the
> last ten years or so, CPU cycles were often a throughput bottleneck.
>
> In pure topological parallel algorithms, the resource you are trying to
> maximally exploit is bandwidth. You never want an idle wire that can carry
> bits between compute elements. Designing code that ensures that all the
> wires in your system are efficiently saturated all the time is a pretty
> foreign concept to most programmers. Doing this well is an extremely rare
> skill, even among people that write software for supercomputers (and hence
> why a lot of supercomputing codes scale relatively poorly in practice).
>
>
The best I've come up with for it is to minimize loaded program size.
Map/reduce/expand is good at doing that by breaking it arrays up into
little bits.

What design patterns do you have for these bandwidth limits?




>
> The ability to design efficient parallel data flows between compute
>> elements is rather important for many classes of algorithm. Join
>> algorithms, for example, which are central to graph traversal.
>>
>
> With FPGAs you can design ANYTHING you can imagine.
>
>
>
> Sure, but many things are faster on commodity fixed silicon.
>

I'm pretty sure ASIC's are faster for all static algorithms.
The reasons FPGA's are nice  is for "dynamic" algorithms, or ones which
weren't predicted to be needed at time of computer construction.   Any
algorithm which becomes thoroughly entrenched enough, or is often used in
FPGA it would make sense to migrate onto ASIC's.   Though in terms of
evolving algorithms, and adapting to new conditions as an AGI would have
to, FPGA's are much better than pure software algorithms.



> I have a friend who designs FPGA codes for high-performance computing
> applications and it is not a good market to be in. This market is getting
> seriously pinched by (1) increasingly rich ISAs in fixed silicon that
> adequately replace most of what an FPGA can do for much cheaper and (2) the
> large and growing percentage of codes that are bandwidth limited.
>
> On the first point, sophisticated vectorized BMI-type ISA extensions give
> you most of the building blocks that an FPGA would give you, and most of
> the inefficiencies of constructing algorithms this way versus an FPGA is
> made up for by the fact that you can run these instructions at 4GHz. And it
> is *much* easier and cheaper to design codes in this domain than to design
> circuits for an FPGA. Intel’s AVX-512 in particular adds a lot of
> functionality for constructing highly pipelined bit-parallel data flow
> algorithms, by design.
>
> IMHO, barrel processors are the most efficient modern computing
> architecture in the abstract. The problem is that there seem to be a dozen
> people in the world that know how to properly design codes for them. This
> is ironic because they are much easier to design for than GPUs or similar
> but no one wants to market them as not being CPU-like and they fake it well
> at first blush.
>

So please enlighten us, as a member of the dozen.

If not map/reduce/expand then what design patterns do you use?



>
> Also, idiomatic C/C++ etc has a CPU-targeted orthodoxy built into it and
> is all programmers know. C++ works fine for barrel processors but idiomatic
> code looks different than for CPUs and no one teaches idiomatic C/C++ for
> barrel processors.
>
>
>
> Long ago ~1982 I presented a paper (the closing presentation at the 1st NN
> conference in San Diego) showing that if you chopped a human brain up into
> many processors that communicated over a single bus, that the
> communications rate would be only ~10^9 brief messages/second. Of course
> this was inconceivably fast for that time, but probably doable now.
>
>
>
> We are not even close to that today, and probably never will be due to
> physics. Any communication system that large is inherently parallel.
>
>
> *AGI* | Archives <https://www.listbox.com/member/archive/303/=now>
> <https://www.listbox.com/member/archive/rss/303/5037279-a88c7a6d> | Modify
> <https://www.listbox.com/member/?&;>
> Your Subscription <http://www.listbox.com>
>



-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Reply via email to