Hi Greg --

I'm Greg Titus and I work on the Chapel runtime, with an emphasis on tasking, low level memory management, and inter-locale communication. Some of what I've written here will likely already be familiar to you, but I've tried to err on the side of giving more info in the hope that will be useful to others.


On 5/27/2015 6:53 AM, Greg Kreider wrote:
Good morning.

A few questions about the runtime -

1. Is there any way to predict how a program's tasks will be
     executed on a given piece of hardware?  We've seen the
     performance of programs change depending on variations in
     the implementation and don't understand why it happens.  A
     few examples.  One program has an outer loop and calls a
     couple subroutines that use forall's, others that use for.
     If we use a for in the outer loop then one core on the CPU
     is loaded 100% and the other 3 at 40%.  Switching to a
     forall or coforall, everything runs serially on one core.
     Another program using cobegin never ran on more than two
     cores.  This is using qthread + hwloc.  Using the quickstart
     compiler programs would saturate all the cores.

     You can get information about when tasks launch; is there
     some way to understand why the runtime is making the choices
     it does for this behavior?  It would be better to make
     intelligent design decisions rather than randomly swapping
     for, forall, and coforall to see the effect - it's all a bit
     of a black box.

The most general answer is that it can be difficult and perhaps impossible to predict how tasks will be executed on a given piece of hardware. But there are some things you can do to improve the situation. First, just so everything is clear, for-stmts are fully serial, coforall-stmts are fully parallel (one Chapel task per iteration), and forall-stmts lie between. In particular, the number of Chapel tasks created for a forall-stmt is determined (perhaps indirectly) by the iterator controlling the loop. By default it's the number of hardware cores (cores, not hyper-threads or the like) on the hardware; this can be adjusted when you run the program by using any of several config params documented in doc/README.tasks (or doc/release/README.tasks if you're working with a copy of the repo instead of a release tarball). The basic idea is that we're trying to maximize use of the hardware and thus reduce wall time, while adding the least amount of overhead we can.

The qthreads and fifo tasking layer implementations work similarly to each other, with differences in the details. In both cases some number of worker threads scan work queues, looking for Chapel tasks to run. In the qthreads tasking layer, with is our current default, there are a number of task queues depending on the number of cores. There is one worker thread tied to each core, with typically more than one worker sharing a queue. The workers are created when the program starts. Tasks are distributed into the queues, and the workers compete to run them. (Without going even further into the details than I already am, you can imagine that load balancing among the workers is done by a combination of round-robin task distribution amongst the queues and perhaps work stealing across them.) If a task suspends, for example because it accesses a sync variable that is not in the desired full/empty state, then the worker can put the task aside to be continued later, and pick up another task that is runnable. Thus task switching is done in user space (no kernel intervention) and is lightweight. An observable side effect of the qthreads tasking layer behavior is that serial programs may run longer than with the fifo tasking layer described next, due to qthreads starting worker threads that will end up not being used.

In the fifo tasking layer there is one queue, worker threads are created on the fly without bound as the runtime sees that there are more tasks in the queue that could be run, tasks occupy their host thread throughout their lifespan, and threads that have completed their tasks hang around looking for more work. A task in the queue will start quite rapidly if it is picked up by a thread looking for work, but if it has to wait for a thread to be created it will start more slowly since thread creation is expensive (a kernel syscall). If a task suspends, again for example because it accesses a sync variable that is not in the desired full/empty state, then the worker retains the task but yields the core. Thus task switching is effectively done in the kernel and is heavier weight than in qthreads tasking. There are a couple of observable side effects of the overall fifo tasking layer behavior. One is that the first parallel construct in a program may run much more slowly than succeeding ones due to ramping up worker threads. Another is that even if a given parallel construct creates more tasks than there are existing threads, you may not get full task parallelism if the task bodies are short compared to how long it takes to make a thread, because in this case even if the runtime creates more threads because the task queue has grown, already-existing threads can mow through the tasks serially before the new threads can be created and arrive to help.

There are several caveats with all of the above. One is that your ability to deduce useful information about the program's tasking behavior by watching any kind of system load monitor will be limited. The reason is that worker threads which don't have Chapel tasks to run and are looking for work will load their respective CPUs pretty heavily as they pound on queues looking for work. There is nothing inherently different in the system monitor output that would allow distinguishing between cores hosting worker threads that are running Chapel tasks and cores hosting worker threads that are looking (perhaps in vain) for Chapel tasks to run.

The best way to get predictable behavior out any of the tasking layers is to give them plenty of work, but let them choose how to place that work on the hardware. Don't second-guess them. In practice this means:

 * Big loops (more iterations and/or larger loop bodies) are better
   than small ones. No tasking layer (in any programming model) will
   work well on a parallel loop whose effective per-task work isn't
   significantly greater than the overhead of managing the parallelism.
 * Don't use a coforall-stmt for a loop with many more iterations than
   you expect to have CPUs. The loop cannot effectively be more
   parallel than the number of CPUs anyway, so forcing the creation of
   more tasks than this will generally just increase overhead (CPU
   time) without reducing wall time. Generally forall-stmts are a
   better way to express loop parallelism.
 * Don't adjust the parameters described in README.tasks unless you
   know (from prior experimentation, say) that doing so yields better
   performance for the particular program you're running. Or, of
   course, because you're experimenting precisely to see the effect of
   such changes.
 * Don't run oversubscribed with CHPL_COMM!=none, that is, executing
   multiple Chapel locales but on a single compute node. The tasking
   layers assume that the hardware they see all belongs to them and if
   there is significant competition from other things either inside the
   program or outside, they can make bad decisions from a performance
   point of view.

I'd have to have more information about your test case(s) to say anything specifically about why you were seeing the CPU loading you were with qthreads tasking. We could tackle that separately from this more general email.


2. Does a 'for param' unroll within a cobegin?  One program
     only ran serially with this setup, while a coforall over the
     parameter ran in parallel.  But maybe this is related to the
     first question.
     The code skeleton looked like
        cobegin {
          for param bank in 1..nbank do run_filter(bank);
          /* process result */
        }
     vs.
        coforall bank in 1..nbank {
          run_filter(bank);
        /* process result */
        }
     where the number of banks was small, about 20, but too many
     to write out.

No, the for-stmt will not unroll. It will just execute serially (as it always does) within the single task created by the cobegin-stmt.


3. Some errors are trapped by the runtime, others are not and
     just exit with a short message to the console.  Examples
     include segfaults and floating point exceptions.  Is it
     possible to print the line number where the error occurred
     (as staring at the code waiting for enlightenment isn't the
     fastest way to finding the problem)?  Or, what way do you
     recommend to debug problems like this?

There's an admittedly imperfect distinction in internal code between internal checks for erroneous situations that are in some sense predictable and those that aren't predictable. For the unpredictable ones (no runtime message) you can always compile with -g and set your core file limit appropriately to allow dumping core for, say, a segfault. However, looking at core files corresponding to Chapel programs is something of a black art due to the fact that the Chapel compiler produces C code and re-compiles that with a C compiler to produce an executable. I confess I'm a little surprised you're getting segfaults unless you're compiling with --fast or something else that turns off checks. The obvious things that would cause segfaults, such as array mis-indexing, should be caught by the checks. Perhaps the most common reason for a segfault would be a stack overflow, since these are "detected" by use of inaccessible guard pages which do precisely that: cause a segfault when the stack is overflowed. Are the segfaulting programs written in such a way that they might require large task stacks? For example, do they have large arrays local to Chapel procs? (I may need some help here from other Chapel folks since I'm primarily a runtime person and the decision as to which Chapel variables are placed on the stack and which are placed in the heap is a bit of a black box to me.)

Floating point halts you'll probably have to chase by means of core dumps, unfortunately.

A potential alternative to core dumps would be to build with CHPL_COMM=gasnet, rebuild Chapel and your test case(s), and then run on just a single locale with '-nl 1'. The GASNet package that underlies our gasnet communication layer will catch most signals and produce a traceback of all program threads. This turns off core file production but can be a quicker way to get a traceback. To reduce the amount of output you have to wade through it might be helpful to limit the number of tasking runtime threads, by setting the CHPL_RT_NUM_THREADS_PER_LOCALE environment variable (see README.tasks again). Don't set it below 2, though, or you can cause deadlock in your Chapel program.


4. We have one program that runs for a small number of iterations
     but dies with a slightly larger (300 trials instead of 100).
     The load on the CPU seems normal, but it will just stop with a
     succinct message on the console: "Killed".  How do we find out
     what's causing the problem?

"Killed" is printed by the system when a process is killed by a SIGKILL signal. The most common reason for this is the system running out of memory, including swap space. When this happens a thingie (for lack of a better word) in the kernel called the "OOM killer" (OOM == Out Of Memory, google "OOM killer" for lots of info) makes a best guess as to the offending process and kills it. It's possible this is related to your question #3. If the product of your task count and per-task memory requirements were big enough I could see this happening. What does the loop structure look like (for/forall/coforall, nesting, etc.) and what are the per-iteration memory requirements?

Hope this helps!

greg


Thanks again for the help,

Greg

------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to