On Mon, Jan 23, 2012 at 11:35 AM, Lux, Jim (337C) <[email protected]> wrote: > The "processors in a sea of memory" model has been around for a while > (and, in fact, there were a lot of designs in the 80s, at the board if not > the chip level: transputers, early hypercubes, etc.) So this is > revisiting the architecture at a smaller level of integration.
I remember 12-15 years ago I was reading quite a few papers published by the Berkeley Intelligent RAM (IRAM) Project: http://iram.cs.berkeley.edu/ So 15 years later someone suddenly thinks that it is a good idea to ship IRAM systems to real customers?? :-D Rayson ================================= Open Grid Scheduler / Grid Engine http://gridscheduler.sourceforge.net/ Scalable Grid Engine Support Program http://www.scalablelogic.com/ > One thing about power consumption.. Those memory cells consume so little > power because most of them are not being accessed. They're essentially > "floating" capacitors. So the power consumption of the same transistor in > a CPU (where the duty factor is 100%) is going to be higher than the power > consumption in a memory cell (where the duty factor is 0.001% or > something). > > And, as always, the challenge is in the software to effectively use the > distributed computing architecture. When you think about it, we've had > almost a century to figure out how to program single instruction stream > computers of one sort or another, and it was easy, because we are single > stream (SISD) ourselves. We can create a simulation of multiple threads > by timesharing in some sense (in either the human or machine models) > > And we have lots of experience with EP type, or even scatter/gather type > processes (tilling land, building pyramids, assembly lines) so that model > of software/hardware architecture can be argued to be a natural outgrowth > of what humans already do, and have been figuring out how to do for > millenia. (did Imhotep use some form of project planning tools? You bet > he did) > > However, true parallelism (MIMD) is harder to conceptualize. Vector and > matrix math is one area, but I'd argue that it's just the same as EP > tasks, just at a finer grain. Systolic arrays, vector pipelines, FFT boxes > from FloatingPointSystems, are all basically ways to use the underlying > structure of the task, in an easy way (how long til there's a hardware > implementation of the new faster-than-FFT algorithm published last week?) > And in all those cases, you have to explicitly make use of the special > capabilities. That is, in general, the compiler doesn't recognize it > (although, modern parallelizing compilers ARE really smart.. So they > probably do find most of the cases) > > I don't know that we have good conceptual tools to take a complex task and > break it effectively into multiple disparate component tasks that can > effectively run in parallel. It's a hard task for something > straightforward (e.g. Designing a big system or building a spacecraft), > and I don't know that any of outputs of current project planning > techniques (which are entirely manual) can be said to produce > "generalized" optimum outputs. They produce *an* output for dividing the > complex task up (or else the project can't be done), but I don't know that > the output is provably optimum or even workable (an awful lot of projects > over-run, and not just because of bad estimates for time/cost). > > So the problem facing would-be users of new computing architectures (be > they TOMI, HyperCube, ConnectionMachine, or Beowulf) is like that facing a > project planner given a big project, and a brand new crew of workers who > speak a different language, with skill sets totally different than the > planner is used to. > > This is what the computer user is facing: There's no compiler or problem > description technique that will automatically generate a "work plan" to > use that new architecture. It's all manual, and it's hard, and you're up > against a brute force "why not just hook 500 people up to that rock and > drag it" approach. The people who figure out the new way will certainly > benefit society, but there's going to be a lot of false starts along the > way. And, I'm not particularly sanguine about the process being automated > (at least in the sense of automatic parallelizing compilers that recognize > loops and repetitve stuff). I think that for the next few years > (decades?) using new architectures is going to rely on skilled humans to > figure out how to use it, on an ad hoc, unique to each application, basis. > > > [Back in the 80s, I had a loaner "sugarcube" 4 node Intel hypercube > sitting on my desk for a while. I wanted to figure out something to do > with it that is non-trivial, and not the examples given in the docs (which > focused on stuff like LISP and Prolog). I started, as I'm sure many > people do, by taking a multithreaded application I had, and distributing > the threads to processors. You pretty quickly realize, though, that it's > tough to evenly distribute the loads among processors, and you wind up > with processor 1 waiting for something that processor 2 is doing, which in > turn is waiting for something that processor 3 is doing, and so forth. In > a "shared processor" this isn't a big deal, and is transparent: the > processor is always working, and aside from deadlocks, there's no > particular reason why you need to balance load among threads. > > For what it's worth, the task I was doing was comparable to taking > execution of a Matlab/simulink model and distributing it across multiple > processors. You had signals flowing among blocks, etc. These things are > computationally intensive (especially if you have loops in the design, so > you need an iterative solution of some sort) so the idea of putting > multiple processors to work is attractive. But the "work" in each block > in the diagram isn't known a-priori and might vary during the course of > the simulation, so it's not like you can come up with some sort of > automatic partitioning algorithm. > > > On 1/23/12 7:38 AM, "Prentice Bisbal" <[email protected]> wrote: > >>If you read this PDF from Venray Technologies, which is linked to in the >>article, you see where the 'Whole Bunch of Crazy" part comes from. After >>reading it, Venray lost a lot of credibility in my book. >> >>https://www.venraytechnology.com/economics_of_cpu_in_DRAM2.pdf >> >>-- >>Prentice >> >> >>On 01/23/2012 08:45 AM, Eugen Leitl wrote: >>> (Old idea, makes sense, will they be able to pull it off?) >>> >>> >>>http://hothardware.com/News/CPU-Startup-Combines-CPUDRAMAnd-A-Whole-Bunch >>>-Of-Crazy/ >>> >>> CPU Startup Combines CPU+DRAM‹And A Whole Bunch Of Crazy >>> >>> Sunday, January 22, 2012 - by Joel Hruska >>> >>> The CPU design firm Venray Technology announced a new product design >>>this >>> week that it claims can deliver enormous performance benefits by >>>combining >>> CPU and DRAM on to a single piece of silicon. We spent some time >>>earlier this >>> fall discussing the new TOMI (Thread Optimized Multiprocessor) with >>>company >>> CTO Russell Fish, but while the idea is interesting; its presentation is >>> marred by crazy conceptualizing and deeply suspect analytics. >>> >>> The Multicore Problem: >>> >>> There are three limiting factors, or walls, that limit the scaling of >>>modern >>> microprocessors. First, there's the memory wall, defined as the gap >>>between >>> the CPU and DRAM clock speed. Second, there's the ILP (Instruction Level >>> Parallelism) wall, which refers to the difficulty of decoding enough >>> instructions per clock cycle to keep a core completely busy. Finally, >>>there's >>> the power wall--the faster a CPU is and the more cores it has, the more >>>power >>> it consumes. >>> >>> Attempting to compensate for one wall often risks running afoul of the >>>other >>> two. Adding more cache to decrease the impact of the CPU/DRAM speed >>> discrepancy adds die complexity and draws more power, as does raising >>>CPU >>> clock speed. Combined, the three walls are a set of fundamental >>> constraints--improving architectural efficiency and moving to a smaller >>> process technology may make the room a bit bigger, but they don't >>>remove the >>> walls themselves. >>> >>> TOMI attempts to redefine the problem by building a very different type >>>of >>> microprocessor. The TOMI Borealis is built using the same transistor >>> structures as conventional DRAM; the chip trades clock speed and >>>performance >>> for ultra-low low leakage. Its design is, by necessity, extremely >>>simple. Not >>> counting the cache, TOMI is a 22,000 transistor design, as compared to >>>30,000 >>> transistors for the original ARM2. The company's early prototypes, >>>built on >>> legacy DRAM technology, ran at 500MHz on a 110nm process. >>> >>> Instead of surrounding a CPU core with a substantial amount of L2 and L3 >>> cache, Venray inserted a CPU core directly into a DRAM design. A TOMI >>> Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of >>>16 >>> ICs per 2GB DIMM. This works out to a total of 128 processor cores per >>>DIMM. >>> Because they're built using ultra-low-leakage processes and are so >>>small, >>> such cores cost very little to build and consume vanishingly small >>>amounts of >>> power (Venray claims power consumption is as low as 23mW per core at >>>500MHz). >>> >>> It's an interesting idea. >>> >>> The Bad: >>> >>> When your CPU has fewer transistors than an architecture that debuted in >>> 1986, it's a good chance that you left a few things out--like an FPU, >>>branch >>> prediction, pipelining, or any form of speculative execution. Venray >>>may have >>> created a chip with power consumption an order of magnitude lower than >>> anything ARM builds and more memory bandwidth than Intel's highest-end >>>Xeons, >>> but it's an ultra-specialized, ultra-lightweight core that trades 25 >>>years of >>> flexibility and performance for scads of memory bandwidth. >>> >>> >>> The last few years have seen a dramatic surge in the number of >>>low-power, >>> many-core architectures being floated as the potential future of >>>computing, >>> but Venray's approach relies on the manufacturing expertise of >>>companies who >>> have no experience in building microprocessors and don't normally serve >>>as >>> foundries. This imposes fundamental restrictions on the CPU's ability to >>> scale; DRAM is manufactured using a three layer mask rather than the >>>10-12 >>> layers Intel and AMD use for their CPUs. Venray already acknowledges >>>that >>> these conditions imposed substantial limitations on the original TOMI >>>design. >>> >>> Of course, there's still a chance that the TOMI uarch could be >>>effective in >>> certain bandwidth-hungry scenarios--but that's where the Venray Crazy >>>Train >>> goes flying off the track. >>> >>> The Disingenuous and Crazy >>> >>> Let's start here. In a graph like this, you expect the two bars to >>>represent >>> the same systems being compared across three different characteristics. >>> That's not the case. When we spoke to Russell Fish in late November, he >>> pointed us to this publicly available document and claimed that the >>>results >>> came from a customer with 384 2.1GHz Xeons. There's no such thing as an >>>S5620 >>> Xeon and even if we grant that he meant the E5620 CPU, that's a 2.4GHz >>>chip. >>> >>> The "Power consumption" graphs show Oracle's maximum power consumption >>>for a >>> system with 10x Xeon E7-8870s, 168 dedicated SQL processors, 5.3TB >>>(yes, TB) >>> of Flash and 15x 10,000 RPM hard drives. It's not only a worst-case >>>figure, >>> it's a figure utterly unrelated to the workload shown in the Performance >>> comparison. Furthermore, given that each Xeon E7-8870 has a 130W TDP, >>>ten of >>> them only come out to 1.3kW--Oracle's 17.7kW figure means that the >>> overwhelming majority of the cabinet's power consumption is driven by >>> components other than its CPUs. >>> >>> From here, things rapidly get worse. Fish makes his points about power >>>walls >>> by referring to unverified claims that prototype 90nm Tejas chips drew >>>150W >>> at 2.8GHz back in 2004. That's like arguing that Ford can't build a >>>decent >>> car because the Edsel sucked. >>> >>> After reading about the technology, you might think Venray was planning >>>to >>> market a small chip to high-end HPC niche markets... and you'd be >>>wrong. The >>> company expects the following to occur as a result of this revolutionary >>> architecture (organized by least-to-most creepy): >>> >>> Computer speech will be so common that devices will talk to other >>>devices >>> in the presence of their users. >>> >>> Your cell phone camera will recognize the face of anyone it sees >>>and scan >>> the computer cloud for backround red flags as well as six degrees of >>> separation >>> >>> Common commands will be reduced to short verbal cues like clicking >>>your >>> tongue or sucking your lips >>> >>> Your personal history will be displayed for one and all to >>>see...women >>> will create search engines to find eligible, prosperous men. Men will >>>create >>> search engines to qualify women. Criminals will find their jobs much >>>more >>> difficult because their history will be immediately known to anyone who >>> encounters them. >>> >>> TOMI Technology will be built on flash memories creating the >>>elemental >>> unit of a learning machine... the machines will be able to self >>>organize, >>> build robust communicating structures, and collaborate to perform tasks. >>> >>> A disposable diaper company will give away TOMI enabled teddy bears >>>that >>> teach reading and arithmetic. It will be able to identify specific >>> children... and from time to time remind Mom to buy a product. The bear >>>will >>> also diagnose a raspy throat, a cough, or runny nose. >>> >>> Conclusion: >>> >>> Fish has spent decades in the microprocessor industry--he invented the >>>first >>> CPU to use a clock multiplier in conjunction with Chuck H. Moore--but >>>his >>> vision of the future is crazy enough to scare mad dogs and Englishmen. >>> >>> His idea for a CPU architecture is interesting, even underneath the >>> obfuscation and false representation, but too practically limited to >>>ever >>> take off. Google, an enthusiastic and dedicated proponent of energy >>> efficient, multi-core research said it best in a paper titled "Brawny >>>cores >>> still beat wimpy cores, most of the time." >>> >>> "Once a chip¹s single-core performance lags by more than a factor to >>>two or >>> so behind the higher end of current-generation commodity processors, >>>making a >>> business case for switching to the wimpy system becomes increasingly >>> difficult... So go forth and multiply your cores, but do it in >>>moderation, or >>> the sea of wimpy cores will stick to your programmers¹ boots like clay." _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
