On Sun, Aug 16, 2015 at 2:20 PM, J. Andrew Rogers <[email protected]> wrote:
> > On Aug 15, 2015, at 10:33 AM, Logan Streondj <[email protected]> wrote: > >> In pure topological parallel algorithms, the resource you are trying to >> maximally exploit is bandwidth. You never want an idle wire that can carry >> bits between compute elements. Designing code that ensures that all the >> wires in your system are efficiently saturated all the time is a pretty >> foreign concept to most programmers. Doing this well is an extremely rare >> skill, even among people that write software for supercomputers (and hence >> why a lot of supercomputing codes scale relatively poorly in practice). >> >> > The best I've come up with for it is to minimize loaded program size. > Map/reduce/expand is good at doing that by breaking it arrays up into > little bits. > > What design patterns do you have for these bandwidth limits? > > > > It generally involves three design characteristics; (1) using data > structures that never require reorganization of your data i.e. no hash > tables or range partitioning, > some examples would be nice, but I guess you mean to prefer simple and data-structures, like strings and arrays. (2) logically generating uniform “work units” of compute that exceed the > number compute elements by at least a couple orders of magnitude, and > so you mean each processing-core should have a queue of 100 or more similar operations. makes sense, to justify the cost of core set-up > (3) have every compute element independently and dynamically schedule its > own work sequence using game theory to orchestrate instead of explicit > coordination. > Okay I understand a core scheduling it's own work from a queue, but not sure what you mean by using "game theory" which is a rather broad concept. #2 is a somewhat trivial but important design element, #1 and #3 are rarely > seen in parallel software systems. > > Unlike map/reduce-style models, which tend to produce pathological amounts > of data motion between steps, > > for the immutable data-structure types certainly, since they copy the whole structure at each step. Though it is possible to use the pattern with mutable arrays, thus avoiding the copying. Admittedly expand may require a new array to be stitched together since the number of elements may change. > this allows you to consistently achieve close to the theoretical minimum > data motion for the computation at hand *and* ensure that the data motion > that does occur is approximately uniform over the wires in the system. > > I'm pretty sure ASIC's are faster for all static algorithms. > The reasons FPGA's are nice is for "dynamic" algorithms, or ones which > weren't predicted to be needed at time of computer construction. Any > algorithm which becomes thoroughly entrenched enough, or is often used in > FPGA it would make sense to migrate onto ASIC's. Though in terms of > evolving algorithms, and adapting to new conditions as an AGI would have > to, FPGA's are much better than pure software algorithms. > > > > By “commodity fixed silicon”, I mean CPUs and similar. From a code > standpoint, these are as dynamic as FPGAs. FPGAs are only efficient for > implementing transforms on data streams. Think encryption, compression, > codecs, etc. As common complex bitwise transform building blocks are added > to CPU instruction sets, turning long idiomatic sequences of > and/or/shift/mask/etc into a single-cycle instruction, it is becoming > increasingly possible to compose transform implementations that are faster > than an FPGA implementation due to the much higher clock rates of CPUs. > > Even the reprogrammability of FPGAs has no advantages. Modern software > increasingly uses dynamic runtime code generation and compilation to > implement core functionality. I’ve designed commercial software that is > continuously compiling parts of itself thousands of times per second in > response to user interaction. > The short-term future use-case I was thinking of is for home-users, who for instance might watch a lot of youtube and Opus files, and so the FPGA could have the V8 and Opus algorithms in FPGA hardware to free up the CPU for other tasks the user might want to do. Long-term is similar, while some modern CPU's might be super-fast, a sentient robot has to process a large amount of information from it's environment, so it will need it's power-intensive CPU to handle it's central consciousness, while the auxillary processors like GPU's and FPGA's to handle "decoding" audio, visual and tactile stimulus into useable information. So for instance the audio-FPGA decoders could spit out some phonetic information with tone, instead of raw audio feeds, drastically reducing the load on the CPU. IMHO, barrel processors are the most efficient modern computing >> architecture in the abstract. The problem is that there seem to be a dozen >> people in the world that know how to properly design codes for them. This >> is ironic because they are much easier to design for than GPUs or similar >> but no one wants to market them as not being CPU-like and they fake it well >> at first blush. >> > > So please enlighten us, as a member of the dozen. > > If not map/reduce/expand then what design patterns do you use? > > > > That is far beyond the scope of this mailing list and a very long > discussion. A number of simple design patterns that are (correctly) assumed > to be expensive and non-scalable on CPUs, and therefore to be avoided, are > approximately free on barrel processors. > would you happen to know the names of the aforementioned "simple design patterns", or perhaps be able to give a description? > People tend to design code as though it is a CPU where it is just much > easier to design multithreaded code that works well. However, the standard > set of data structures and algorithms every programmer uses assumes a CPU > architecture and are substantially suboptimal on a barrel processor. If you > want to use them well, you need different data structures and algorithms > for your basic computing tasks. > > Anyone can read the silicon reference documentation. Most programmers have > little ability to design approximately optimal data structures and > algorithms for an arbitrary silicon architecture from first principles. In > the specific case of barrel processors, there is a dearth of literature you > can look at where people have already figured it out. > > This has been the problem with supercomputers generally. No one designs > algorithms that match the hardware. > Yes, so true. I'm happy you have the opportunity to work on the niche of hardware architectures you enjoy. I'm hoping that with GI-OS (general-intelligence operating-system), it will be able to optimize it's limiting algorithms to fit better with the hardware, through various kinds of narrow-AI algorithms. For instance this self-programming (BrainPlus) AI "took advantage of a buffer underflow exception to terminate the program early" http://www.primaryobjects.com/CMS/Article163 So it is certainly possible for AI's in the future to take advantage of hardware features even beyond what is noted in the specs. > > *AGI* | Archives <https://www.listbox.com/member/archive/303/=now> > <https://www.listbox.com/member/archive/rss/303/5037279-a88c7a6d> | Modify > <https://www.listbox.com/member/?&> > Your Subscription <http://www.listbox.com> > ------------------------------------------- AGI Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424 Modify Your Subscription: https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657 Powered by Listbox: http://www.listbox.com
