On Fri, Dec 5, 2014 at 11:33 AM, <[email protected]> wrote: > On Fri, 05 Dec 2014 11:18:57 -0800, Dave Taht said: >> The Mill is an extremely wide-issue VLIW design, able to issue 30+ >> MIMD operations per cycle. The Mill is inherently a vector machine >> and can vectorize and pipeline almost all loops in general purpose >> code. > > The big question is whether we know more about writing compilers for VLIW > machines than we did when the Itanium came out. That was hard enough to > get just 3 instructions packed per word (of course, the fact that it wasn't > 3 generic instructions, but 2 of one flavor and 1 of another, didn't help).
Well, in this case half the instructions are one flavor the other half another. But it's the belt concept in the "mill" that is key. Basically, having tons and tons of fixed addressible registers doesn't work well (as in the itanium, sparc, and other arches) for a variety of reasons... Taking a classic smaller register set, such as in the x86_64, and trying add all these superscalar and out of order features to it has hit a brick wall ... and the best we see in arms and mips ( with way more registers) is typically two out of order ops, total. stack machines overly serialize operations and tend to bottleneck on local cache (see the transputer T800 for the last decent example) Aside from a bunch of genuinely weirder architectures (see for example the propeller, or dave may's xcore stuff, or parallella) the mill's "belt" idea - temporal register addressing - is the first new idea I've seen in cpu design for a very, very long time. (perhaps it was tried in some other architecture?) Even if the mill can't get to 32 ops/cycle generally (and some of those ops are overhead in maintaining the belt, but not as much as you might think), I do think it can get to quite a few, even in branchy code, and the lower end versions of the arch are comparable in ops/cycle to the best we can do today with computers running at much faster basic clock rates. and context switch/subroutine call overhead! 4 cycles. Wow. :) I certainly have quibbles with the presos I've read so far, edge cases like floating point ops, and other seemingly nice-to-have but not critical to the core architecture feature(s)... but I long for a FPGA version, at least, to play with. I've spent a lot of time trying to come up with a microarchitecture that could do fq_codel at 10GigE+ speeds (prototyping in the parallella's FPGA), and kept dreaming of something like the "propeller" at a really high clock rate... ... then I stumbled over this. Sure, it's years out, but, like wow. Well worth an initial hour to read/think/watch about. -- Dave Täht thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks _______________________________________________ Bloat mailing list [email protected] https://lists.bufferbloat.net/listinfo/bloat
