Re: [Caml-list] Re: multicore wish

Jon Harrop Tue, 22 Dec 2009 09:59:10 -0800

On Tuesday 22 December 2009 13:09:27 Goswin von Brederlow wrote:
> Jon Harrop <j...@ffconsultancy.com> writes:
> > 1. The array "a" is just an ordinary array of any type of values on the
> > shared heap in F# but, for generality in OCaml, this must be both the
> > underlying ordinary data and a manually-managed shared big array of
> > indices into the ordinary data where the indices get sorted while the
> > original data remain in place until they are permuted at the end.
>
> Unless you have a primitive type that isn't a pointer.


In OCaml, you would need to write a custom quicksort optimized for that 
particular type. In F#, the generic version just works and works efficiently.

> The advantage with ocaml though is that you never have pointers into a
> structure. Makes thinks a lot simpler for the GC and avoids large
> overheads in memory.

I don't understand what you mean by OCaml "never has pointers into a 
structure". Half the problem with OCaml is that OCaml almost always uses 
pointers and the programmer has no choice, e.g. for complex numbers.

> > 2. The sorted subarrays are contiguous in memory and, at some
> > subdivision, will fit into L2 cache. So F# offers optimal locality. In
> > contrast, there is no locality whatsoever in the OCaml code and most
> > accesses into the unsorted original array will incur cache misses right
> > up to main memory. So the OCaml approach does not scale as well and will
> > never see superlinear speedup because it cannot be cache friendly.
>
> On the other hand swapping two elements in the array has a constant
> cost no matter what size they have. At some size there will be a break
> even point where copying the data costs more than the cache misses and
> with increasing size the cache won't help F# so much either.

In theory, yes. In practice, that threshold is far larger than any value type 
that a real program would use so it is of no practical concern. Moreover, F# 
gives the programmer control over whether data are unboxed (value types) or 
boxed (reference types) anyway. In contrast, OCaml is tied to a few value 
types that are hard-coded into the GC.

> > 3. Child tasks are likely to be executed on the same core as their parent
> > and use a subset of their parent's data in F#, so they offer the best
> > possible locality. In contrast, child processes are likely to be executed
> > on another core in OCaml and offer the worst possible locality.
>
> But, if I understood you right, you first fork one process per
> core.

No, in OCaml I fork every child. That is the only transparent way to give the 
child a coherent view of the heap but it is extremely slow (~1ms):

  "F# can do 60MFLOPS of computation in the time it takes OCaml 
to fork a single process" -
http://caml.inria.fr/pub/ml-archives/caml-list/2009/06/542b8bed77022b4a4824de2da5b7f714.en.html

> Those should then also each pin themself to one core.

You have no idea which core a forked child process will run on.

> Each process then has a work queue which is works through. So they will 
> always use the local data. Only when a queue runs dry they steal from
> another process and ruin locality.

You are correctly describing the efficient solution based upon work-stealing 
queues that F# uses but OCaml cannot express it.

> So I don't see where your argument fits. You are not creating childs
> on the fly. Only at the start and they run till all the work is done.
> At least in this example.

No, every recursive invocation of the parallel quicksort spawns another child 
on-the-fly. That's precisely why it parallelizes so efficiently when you have 
wait-free work-stealing task deques and a shared heap. In general, you 
rewrite algorithms into this recursive divide and conquer form and 
parallelize when possible. You can parallelize a *lot* of problems 
efficiently that way.

> > 4. Task deques can handle an arbitrary number of tasks limited only by
> > memory whereas processes are a scarce resource and forking is likely to
> > fail, whereupon the "invoke" combinator will simply execute sequentially.
> > So it is much easier to write reliable and performant code in F# than
> > OCaml.
>
> Why would you fork in invoke?

Fork is currently the only transparent way to implement "invoke" but it is 
extremely slow and unreliable.

> > 5. OCaml's fork-based "invoke" combinator is many orders of magnitude
> > slower than pushing a closure onto a concurrent task deque in F#.
> >
> > 6. The "threshold" value is a magic number derived from measurements on a
> > given machine in my OCaml code but is dynamically adjusted in a
> > machine-independent way by the "invoke" combinator in my F# code using
> > atomic operations and real time profiling of the proportion of time spent
> > spawning tasks vs doing actual work.
>
> 5+6 seem to be an implementation detail of some specific
> implementation you are talking about.

Yes. I'm talking about today's OCaml and F# implementations.

> I don't see anything in the theory that would require that.

If by "in theory" you mean that a new performant concurrent GC for OCaml would 
solve these problems then yes. But I doubt OCaml is ever going to get one.

> > The same basic principles apply to many algorithms. Although it can
> > sometimes be tricky to figure out how best to use this technology to
> > parallelize a given algorithm (e.g. successive over relaxation), I have
> > found that a great many algorithms can be parallelized effectively using
> > this approach when you have a suitable foundation in place (like the
> > TPL). Moreover, the ability to use ordinary constructs in F# instead of
> > hacks like type-specific shared memory big arrays in OCaml makes it a lot
> > easier to parallelize programs. My parallel Burrows-Wheeler Transform
> > (BWT), for example, took 30 minutes to develop in F# and 2 days in OCaml.
>
> It might be true that ocaml lacks some primitives for multi-core
> operations. But I would say that is mostly because so far it hasn't
> supported multi-core.

Well, yes. :-)

> If F# has great support for this then that might be a good place to steal
> some ideas from. 

The infrastructure for this kind of shared memory parallel programming is 
really very simple. You just need a GC that can handle a shared heap (which 
HLVM already has) and work-stealing task deques. Then you can easily write 
parallel programs that leverage multicores and run a *lot* faster than 
anything that can be written in OCaml. You can even make this accessible to 
OCaml programmers as a DSL with automatic interop to make high performance 
parallel programming as easy as possible from OCaml.

> But so far have heart nothing that would make F# fundamentally more capable
> than ocaml could become.

In theory, OCaml could catch up with F#. In practice, the core of OCaml's 
implementation is so heavily optimized in another direction (and that will 
not change because it is OCaml's raison d'être) that it is worth starting 
from scratch. Modern libraries like LLVM make this comparatively painless and 
the improvements are vast. And besides, it is great fun! :-)

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Re: [Caml-list] Re: multicore wish

Reply via email to