On 23/08/2014, at 3:12 PM, srean wrote:
> 
> But vals are nice. In any case my point was about the fact that the full 
> syntax for assignment / initialization for val/var are not shown in much 
> detail early on.

http://felix-lang.org/share/src/web/tut/tut_06.fdoc

shows the basic syntax.
 
> > I have seen people go
> >
> > "Yay! huge matrix, huge vector, so parallel, much fast, so speed"
> >
> > to actually coding it up with threads etc and then get bitten by the 
> > reality that it is slower than a dumb serial implementation,
> 
> Of course this can happen.
>  
> Its not so much that it _can_ happen, it almost always happens, unless one 
> has carefully thought about the order in which the data is touched. 

This is probably mainly the case on dual cores.
Unfortunately that's all I have. If I had a 12 core box
better can be done. But I can't even measure it.

>  
> However Intel CPU's have a separate cache for each core.
> And separate FPUs.
> 
> Nope, some of their quad cores have caches shared between a pair, not their 
> L1 though.

I mean the on chip primary cache. Noting we're talking about
physical cores (not virtual CPU things Intel tries to make work).

> The killer is that the bus is shared.

Look, I got laughed off the HPC list (Atlas) when I said caching
is the ENEMY of performance.

So I'll say it again. CPU's should have NO caches,
other than a couple of registers.

Synchronising caches costs too much.

There should be caches alright -- in the memory chips.

The reason is obvious: many parallel CPU's accessing
the memory find their data cached, without any synchronisation
issues.

Of course this is how the REAL parallel machines actually work,
namely GPUs.

The primary problem getting parallel performance is data and address
path architecture. In other words, ROUTING.

Notice I didn't say "buss", because a "buss" was new idea,
invented by Dec I think. Older machines did not have busses.
Neither do new ones (they all have multiple data/address
paths, at least to GPU, local fast memort, and off board slower
shared RAM: three paths at least).

** Just to explain: a buss is a parallel data path
which does to ALL devices. It works by allowing only
one device to control the buss: typically one writer
and one reader. All the other devices are tri-stated
(effectively disconnected from the buss).

Older machines had two busses: one for the address
and one for the data. Every device would decode the
address and only go on if the address was theirs.
In reality some high bits are ignored, and the next 
lot decoded by a separate decoder chip, because the
memory chips only decode a few bits for selection.
Internally the lower bits decide what is fetched and
put on the data path.

Dec invented a thing called Unibus which saved a lot of pins:
they put both the address and data information on the same wires.
Reduced 32 pins to 16 + 1 for decoding, at the expense of an
external address latch.

--
john skaller
skal...@users.sourceforge.net
http://felix-lang.org




------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Felix-language mailing list
Felix-language@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/felix-language

Reply via email to