On Wednesday, 16 September 2015 at 19:21:59 UTC, deadalnix wrote:
No you don't. Because the streamer still need to load the unum one by one. Maybe 2 by 2 with a fair amount of hardware speculation (which means you are already trading energy for performances, so the energy argument is weak). There is no way you can feed 256+ cores that way.

You can load continuously 64 bytes in a stream, decode to your internal format and push them into the scratchpad of other cores. You could even do this in hardware.

If you look at the ubox brute forcing method you compute many calculations over the same data, because you solve spatially, not by timesteps. So you can run many many parallell computations over the same data.

To gives you a similar example, x86 decoding is often the bottleneck on an x86 CPU. The number of ALUs in x86 over the past decade decreased rather than increased, because you simply can't decode fast enough to feed them. Yet, x86 CPUs have a 64 ways speculative decoding as a first stage.

That's because we use a dumb compiler that does not prefetch intelligently. If you are writing for a tile based VLIW CPU you preload. These calculations are highly iterative so I'd rather think of it as a co-processor solving a single equation repeatedly than running the whole program. You can run the larger program on a regular CPU or a few cores.

The problem is not transistor it is wire. Because the damn thing is variadic in every ways, pretty much every bit as input can end up anywhere in the functional unit. That is a LOT of wire.

I haven't seen a design, so I cannot comment. But keep in mind that the CPU does not have to work with the format, it can use a different format internally.

We'll probably see FPGA implementations that can be run on FPGU cards for PCs within a few years. I read somewhere that a group in Singapore was working on it.

Reply via email to