On Sunday, December 13, 2015 at 9:51:31 AM UTC, Tamas Papp wrote:
> AFAIK typical SBC CPUs are not heavily optimized for floating point;
> there is an order of magnitude difference compared to an x86. I don't
> understand how a cluster would make economic sense, even for tasks that
> parellelize well (and then there is the network overhead).
>
I'm sure you are right about ARM's floating point not being as good as x86,
especially for ARMv6, but is this changing for ARMv7 and the 64/32-bit
ARMv8-A?
Even Intel in their latest Xeon Phi uses Atom cores (same, or possibly
modified, haven't read too closely yet) as in smartphones, just more of
them AND yes lots of cache. That of course helps with floating-point/HPC,
and in general the better memory hierarchy. That is however both helpful
for FPU and integer, and ARM isn't slow anymore for integer at least.
I just saw the other day that on some benchmark (I forget which, not sure
if it uses FPU) from 3 months ago, that the A9[X] in the iPhone 6s/iPad
slightly beats a MacBook from this year at 1.2 GHz, with non-Atom
("mobile"), Core M CPU ("laptop", would you not call that "mobile"-kind?).
Anyway, for the argument, let's say the FPU is slow[er]. You also have GPUs
commonly (are you ignoring those?), and I see latest Adreno in Qualcomm has
unified "[virtual] memory" with the CPU (not sure why they put in virtual).
GPGPU is good with at least some of the GPUs (Adreno says latest is 40%
faster), such as Nvidia's.
Sadly Mali400-MP2 or at least Utgard the microarch it's based on, in the
PINE64, is listed with "Graphics", not "Graphics & Compute" on the table:
https://en.wikipedia.org/wiki/Mali_(GPU)
And it doesn't have fused-multiply-add (while some do) and "Some Malis
support cache coherency <https://en.wikipedia.org/wiki/Cache_coherence> for
the L2 cache with the CPU"
Commonly GPUs use single-precision (double not really in consumer, any
maybe not at all in mobile?), but it might do - with tricks, or not..
[About clusters: Look into Unum (Universal number) that will replace
traditional floating point, not only for speed/energy (while then needing?
better hardware), but also more correct answers.. AND turn "embarrassingly
serial" code into data parallel ("the easiest kind").. I may write a post
here, in a separate thread, we shouldn't high-jack this one. Maybe add my
thought to the Flexnum one?]
>AFAIK Julia is available in Raspbian. May not be the most recent
version though, but 3.2 looks like it is there:
>http://archive.raspbian.org/raspbian/pool/main/j/julia/
<http://www.google.com/url?q=http%3A%2F%2Farchive.raspbian.org%2Fraspbian%2Fpool%2Fmain%2Fj%2Fjulia%2F&sa=D&sntz=1&usg=AFQjCNHRIOZZWPINh6BzY6A36dSYyekNRw>
Doesn't this need to be fixed? 0.4.x even more important on those slower
computers? OR is the overhead of compiling going up? Is there a fast
compile/less optimization/"almost interpreted".. needed/already available?
Or useful, as counter-productive at runtime..? Could there be a way to
optimize different functions differently? Already done, say with inlined
(leaf) functions optimized more? (And others for code space density and
fast compilation).
P.S. Was Blackberry Pi 2 just a typo or did I miss something?
> Best,
>
> Tamas
>
> On Sun, Dec 13 2015, cdm <[email protected] <javascript:>> wrote:
>
> > while this SBC would represent a substantial improvement
> > over the Pi systems currently in market, i suspect that the
> > most notable aspect is the price ...
> >
> > generally, as the price points come down, clusters become
> > much more feasible ...
> >
> > parallelism is next.
>