On 13-nov-10, at 22:23, Gary Whatmore wrote:

parallel noob Wrote:

Hello

Intro: people with pseudonyms are often considered trolls here, but this is a really honest question by a sw engineer now writing mostly sequential web applications. (I write "parallel" web apps, but the logic goes that you write sequential applications for each http query, the frontend distributes queries among backend http processes, and the database "magically" ensures proper locking. There's hardly ever any locks in my code.)

D is touted as the next gen of multicore languages. I pondered between D and D.learn, where to ask this. It just strikes me odd that there isn't any kind of documentation explaining how I should write (parallel) code for multicore in D. If D is much different, the general guidelines for PHP web applications, Java, or Erlang might don't work. From what I've gathered from these discussions, there are:

- array operations and auto-parallelization of loops
- mmx/sse intrinsics via library
- transactional memory (requires hardware support? doesn't work?)
- "erlang style" concurrency? == process functions in Phobos 2?
- threads, locks, and synchronization primitives

Sean, sybrandy, don, fawzi, tobias, gary, dsimcha, bearophile, russel, trass3r, dennis, and simen clearly have ideas how to work with parallel problems.

A quick look at wikipedia gave http://en.wikipedia.org/wiki/Parallel_computing and http://en.wikipedia.org/wiki/Parallel_programming_model

I fail to map these concepts discussed here with the things listed on those pages. I found MPI, POSIX Threads, TBB, Erlang, OpenMP, and OpenCL there.

Sean mentioned:

"In the long term there may turn out to be better models, but I don't know of one today."

So he's basically saying that those others listed in the wikipedia pages are totally unsuitable for real world tasks? Only Erlang style message passing works?

The next machine I buy comes with 12 or 16 cores or even more -- this one has 4 cores. The typical applications I use take advantage of 1-2 threads. For example a cd ripper starts a new process for each mp3 encoder. The program runs at most 3 threads (the gui, the mp3 encoder, the cd ripper). More and more applications run in the browser. The browser actively uses one thread + one thread per applet. I can't even utilize more than 50% of the power of the current gen!

The situation is different with GPUs. My Radeon 5970 has 3200 cores. When the core count doubles, the FPS rating in games almost doubles. They definitely are not running Erlang style processes (one for GUI, one for sounds, one for physics, one for network). That would leave 3150 cores unused.

there are different kinds of parallel problems, some are trivially, or almost trivially parallel, other are less parallel. Some tasks are very quick (one talks of micro parallelism), other are much more coarse.

typical code has a limited parallelization potential, out of order execution of modern processors tries to take advantage of this, but normally having a lot of execution hardware is not useful because there is a limited amount of instruction level parallelism (ILP) is limited. There is an important exception: vector operations. So processor often have vector hardware to do them efficiently. Compiler to take advantage of them vectorize loops. Array operations are a class of operations (that include vector operations) that are often very parallel. If one for example wants to apply a pure operations on an array this is trivially parallel. Data parallel languages are especially good to express this kind of parallelism.

GPU are optimized graphical operations which are mainly kind of vector and array operations and thus have a large amount of this kind of parallelization. This is also present in some scientific programs, and indeed GPU (with CUDA or openCL) are increasingly used for that.

The more coarse levels of parallelization use other means.
In my opinion shared memory parallelization can be done efficiently if on is able to treat independent recursive tasks. Recursive task (that come form example from divide & conquer approaches, and for example can be used to perform array operations) can be evaluated efficiently evaluating subtasks first, and stealing supertasks keeping into account the locality of processors (cilk has a such an approach).

Independent tasks can be represented by threads, should be executed fairly, and work well to represent a single interacting object or different requests for a web server. All OS give support for this, as threads (unlike processes) share memory one has to take care that changes from one thread and another are done in a meaningful way. To achieve this there are locks, atomic ops,... Transactional memory works for changes that should be done atomically (the big problem is that if something fails one has to undo everything and retry again, something that becomes more and more likely the longer the transaction).

What I tried to achieve with blip.parallel.smp is to accommodate reasonably well both kinds of parallelism. Several higher level abstractions can then be built using this framework. I feel that it is important to handle this centrally because a processor is used optimally when each core has a thread. Actually to hide the latency introduced by some operations that would make a processor waste cycles, some processors keep two active threads and switch quickly from one to the other when one stalls. This is called hyperthreading, and while it doesn't improve the performance of a single thread (actually it is worse) it optimizes the throughput and and usage of the execution units of a single core.

I think that almost all shared memory parallelization approaches can be implemented reasonably efficiently on the top of tasks, possibly using some locality hints for the initial distribution.

PGAS (Partitioned global address space) tries to allow one to have a global view, while still having local storages, each process can access both his local and remote just indexing. the conceptual advantage of this is that one can easily migrate to it starting from a local view. The hope is that one can optimize the layout and patterns used to access the memory later and at least partially independently from the algorithm used. At least in some cases it is possible.

In distributed memory approaches each process has its own memory space, and cannot access directly the memory of other processes. This is potentially more complex than PGAS/shared memory approaches, because one has to explicitly transfer data between different processes to communicate. Normally one uses *messages* to do it. The big advantage of doing this is that the programmer has to think more explicitly about the potential costly communication operations (if one has a latency of few ms this is still 10^6 cycles of a typical processor), and possibly optimize it better.

The distributed memory space scales very well, as one can create easily large clusters of computers.
MPI (message passing interface) uses this approach.
But actually mpi is more about having collective communication patterns efficiently implemented and being able to create optimal subsets to do subwork. MPI is the correct choice if you have a complex problem (also for the parallelization) and you want to efficiently use *all* resources available.

Sometime the problem you have is not so costly that you need to commit all resources to it, you just want to solve it efficiently and if possible taking advantage of the parallelization. In this case a good model is the actor model where objects communicate with messages to each other. One can have thread objects with mailboxes and pattern matching to select the message, or objects with an interface and a remote procedure call to invoke them. You can organize the network of messages in several ways, you can have a central server, and clients that connect, you can have a central database to communicate, you can have a peer to peer structure, you can have producer/consumer relationships.
Normally given a problem one can see how to partition it optimally.

I use blip.parallel.rpc to give that kind of messaging between objects.
Note that one has to think about failure of one part in this model, not necessarily a failure of one process should stop all processes (in some cases it might even be undetected). At this level one could theoretically migrate processes/objects automatically, but given that the latency increase can be very large (~10^6) this automatic distribution is doable only for tasks that were considered by the programmer, a fully automatic redistribution of any object is not realistic.

I hope this overview helps a bit
Fawzi

Reply via email to