Re: D and multicore

Fawzi Mohamed Sat, 13 Nov 2010 17:44:40 -0800


On 13-nov-10, at 22:23, Gary Whatmore wrote:

parallel noob Wrote:
Hello
Intro: people with pseudonyms are often considered trolls here, butthis is a really honest question by a sw engineer now writingmostly sequential web applications. (I write "parallel" web apps,but the logic goes that you write sequential applications for eachhttp query, the frontend distributes queries among backend httpprocesses, and the database "magically" ensures proper locking.There's hardly ever any locks in my code.)
D is touted as the next gen of multicore languages. I ponderedbetween D and D.learn, where to ask this. It just strikes me oddthat there isn't any kind of documentation explaining how I shouldwrite (parallel) code for multicore in D. If D is much different,the general guidelines for PHP web applications, Java, or Erlangmight don't work. From what I've gathered from these discussions,there are:
- array operations and auto-parallelization of loops
- mmx/sse intrinsics via library
- transactional memory (requires hardware support? doesn't work?)
- "erlang style" concurrency? == process functions in Phobos 2?
- threads, locks, and synchronization primitives
Sean, sybrandy, don, fawzi, tobias, gary, dsimcha, bearophile,russel, trass3r, dennis, and simen clearly have ideas how to workwith parallel problems.
A quick look at wikipedia gave http://en.wikipedia.org/wiki/Parallel_computingand http://en.wikipedia.org/wiki/Parallel_programming_model
I fail to map these concepts discussed here with the things listedon those pages. I found MPI, POSIX Threads, TBB, Erlang, OpenMP,and OpenCL there.
Sean mentioned:
"In the long term there may turn out to be better models, but Idon't know of one today."
So he's basically saying that those others listed in the wikipediapages are totally unsuitable for real world tasks? Only Erlangstyle message passing works?
The next machine I buy comes with 12 or 16 cores or even more --this one has 4 cores. The typical applications I use take advantageof 1-2 threads. For example a cd ripper starts a new process foreach mp3 encoder. The program runs at most 3 threads (the gui, themp3 encoder, the cd ripper). More and more applications run in thebrowser. The browser actively uses one thread + one thread perapplet. I can't even utilize more than 50% of the power of thecurrent gen!
The situation is different with GPUs. My Radeon 5970 has 3200cores. When the core count doubles, the FPS rating in games almostdoubles. They definitely are not running Erlang style processes(one for GUI, one for sounds, one for physics, one for network).That would leave 3150 cores unused.

there are different kinds of parallel problems, some are trivially, oralmost trivially parallel, other are less parallel.Some tasks are very quick (one talks of micro parallelism), other aremuch more coarse.

typical code has a limited parallelization potential, out of orderexecution of modern processors tries to take advantage of this, butnormally having a lot of execution hardware is not useful becausethere is a limited amount of instruction level parallelism (ILP) islimited.There is an important exception: vector operations. So processor oftenhave vector hardware to do them efficiently. Compiler to takeadvantage of them vectorize loops.Array operations are a class of operations (that include vectoroperations) that are often very parallel.If one for example wants to apply a pure operations on an array thisis trivially parallel.Data parallel languages are especially good to express this kind ofparallelism.

GPU are optimized graphical operations which are mainly kind of vectorand array operations and thus have a large amount of this kind ofparallelization. This is also present in some scientific programs, andindeed GPU (with CUDA or openCL) are increasingly used for that.


The more coarse levels of parallelization use other means.

In my opinion shared memory parallelization can be done efficiently ifon is able to treat independent recursive tasks.Recursive task (that come form example from divide & conquerapproaches, and for example can be used to perform array operations)can be evaluated efficiently evaluating subtasks first, and stealingsupertasks keeping into account the locality of processors (cilk has asuch an approach).

Independent tasks can be represented by threads, should be executedfairly, and work well to represent a single interacting object ordifferent requests for a web server.All OS give support for this, as threads (unlike processes) sharememory one has to take care that changes from one thread and anotherare done in a meaningful way. To achieve this there are locks, atomicops,...Transactional memory works for changes that should be done atomically(the big problem is that if something fails one has to undo everythingand retry again, something that becomes more and more likely thelonger the transaction).

What I tried to achieve with blip.parallel.smp is to accommodatereasonably well both kinds of parallelism.Several higher level abstractions can then be built using thisframework.I feel that it is important to handle this centrally because aprocessor is used optimally when each core has a thread.Actually to hide the latency introduced by some operations that wouldmake a processor waste cycles, some processors keep two active threadsand switch quickly from one to the other when one stalls. This iscalled hyperthreading, and while it doesn't improve the performance ofa single thread (actually it is worse) it optimizes the throughput andand usage of the execution units of a single core.

I think that almost all shared memory parallelization approaches canbe implemented reasonably efficiently on the top of tasks, possiblyusing some locality hints for the initial distribution.

PGAS (Partitioned global address space) tries to allow one to have aglobal view, while still having local storages, each process canaccess both his local and remote just indexing. the conceptualadvantage of this is that one can easily migrate to it starting from alocal view. The hope is that one can optimize the layout and patternsused to access the memory later and at least partially independentlyfrom the algorithm used. At least in some cases it is possible.

In distributed memory approaches each process has its own memoryspace, and cannot access directly the memory of other processes.This is potentially more complex than PGAS/shared memory approaches,because one has to explicitly transfer data between differentprocesses to communicate. Normally one uses *messages* to do it.The big advantage of doing this is that the programmer has to thinkmore explicitly about the potential costly communication operations(if one has a latency of few ms this is still 10^6 cycles of a typicalprocessor), and possibly optimize it better.

The distributed memory space scales very well, as one can createeasily large clusters of computers.

MPI (message passing interface) uses this approach.

But actually mpi is more about having collective communicationpatterns efficiently implemented and being able to create optimalsubsets to do subwork. MPI is the correct choice if you have a complexproblem (also for the parallelization) and you want to efficiently use*all* resources available.

Sometime the problem you have is not so costly that you need to commitall resources to it, you just want to solve it efficiently and ifpossible taking advantage of the parallelization.In this case a good model is the actor model where objects communicatewith messages to each other.One can have thread objects with mailboxes and pattern matching toselect the message, or objects with an interface and a remoteprocedure call to invoke them.You can organize the network of messages in several ways, you can havea central server, and clients that connect, you can have a centraldatabase to communicate, you can have a peer to peer structure, youcan have producer/consumer relationships.

Normally given a problem one can see how to partition it optimally.

I use blip.parallel.rpc to give that kind of messaging between objects.

Note that one has to think about failure of one part in this model,not necessarily a failure of one process should stop all processes (insome cases it might even be undetected).At this level one could theoretically migrate processes/objectsautomatically, but given that the latency increase can be very large(~10^6) this automatic distribution is doable only for tasks that wereconsidered by the programmer, a fully automatic redistribution of anyobject is not realistic.


I hope this overview helps a bit
Fawzi

Re: D and multicore

Reply via email to