Re: runtime questions

Greg Titus Wed, 27 May 2015 09:34:41 -0700

Hi Greg --

I'm Greg Titus and I work on the Chapel runtime, with an emphasis ontasking, low level memory management, and inter-locale communication.Some of what I've written here will likely already be familiar to you,but I've tried to err on the side of giving more info in the hope thatwill be useful to others.



On 5/27/2015 6:53 AM, Greg Kreider wrote:

Good morning.

A few questions about the runtime -

1. Is there any way to predict how a program's tasks will be
     executed on a given piece of hardware?  We've seen the
     performance of programs change depending on variations in
     the implementation and don't understand why it happens.  A
     few examples.  One program has an outer loop and calls a
     couple subroutines that use forall's, others that use for.
     If we use a for in the outer loop then one core on the CPU
     is loaded 100% and the other 3 at 40%.  Switching to a
     forall or coforall, everything runs serially on one core.
     Another program using cobegin never ran on more than two
     cores.  This is using qthread + hwloc.  Using the quickstart
     compiler programs would saturate all the cores.

     You can get information about when tasks launch; is there
     some way to understand why the runtime is making the choices
     it does for this behavior?  It would be better to make
     intelligent design decisions rather than randomly swapping
     for, forall, and coforall to see the effect - it's all a bit
     of a black box.

The most general answer is that it can be difficult and perhapsimpossible to predict how tasks will be executed on a given piece ofhardware. But there are some things you can do to improve the situation.First, just so everything is clear, for-stmts are fully serial,coforall-stmts are fully parallel (one Chapel task per iteration), andforall-stmts lie between. In particular, the number of Chapel taskscreated for a forall-stmt is determined (perhaps indirectly) by theiterator controlling the loop. By default it's the number of hardwarecores (cores, not hyper-threads or the like) on the hardware; this canbe adjusted when you run the program by using any of several configparams documented in doc/README.tasks (or doc/release/README.tasks ifyou're working with a copy of the repo instead of a release tarball).The basic idea is that we're trying to maximize use of the hardware andthus reduce wall time, while adding the least amount of overhead we can.

The qthreads and fifo tasking layer implementations work similarly toeach other, with differences in the details. In both cases some numberof worker threads scan work queues, looking for Chapel tasks to run. Inthe qthreads tasking layer, with is our current default, there are anumber of task queues depending on the number of cores. There is oneworker thread tied to each core, with typically more than one workersharing a queue. The workers are created when the program starts. Tasksare distributed into the queues, and the workers compete to run them.(Without going even further into the details than I already am, you canimagine that load balancing among the workers is done by a combinationof round-robin task distribution amongst the queues and perhaps workstealing across them.) If a task suspends, for example because itaccesses a sync variable that is not in the desired full/empty state,then the worker can put the task aside to be continued later, and pickup another task that is runnable. Thus task switching is done in userspace (no kernel intervention) and is lightweight. An observable sideeffect of the qthreads tasking layer behavior is that serial programsmay run longer than with the fifo tasking layer described next, due toqthreads starting worker threads that will end up not being used.

In the fifo tasking layer there is one queue, worker threads are createdon the fly without bound as the runtime sees that there are more tasksin the queue that could be run, tasks occupy their host threadthroughout their lifespan, and threads that have completed their taskshang around looking for more work. A task in the queue will start quiterapidly if it is picked up by a thread looking for work, but if it hasto wait for a thread to be created it will start more slowly sincethread creation is expensive (a kernel syscall). If a task suspends,again for example because it accesses a sync variable that is not in thedesired full/empty state, then the worker retains the task but yieldsthe core. Thus task switching is effectively done in the kernel and isheavier weight than in qthreads tasking. There are a couple ofobservable side effects of the overall fifo tasking layer behavior. Oneis that the first parallel construct in a program may run much moreslowly than succeeding ones due to ramping up worker threads. Another isthat even if a given parallel construct creates more tasks than thereare existing threads, you may not get full task parallelism if the taskbodies are short compared to how long it takes to make a thread, becausein this case even if the runtime creates more threads because the taskqueue has grown, already-existing threads can mow through the tasksserially before the new threads can be created and arrive to help.

There are several caveats with all of the above. One is that yourability to deduce useful information about the program's taskingbehavior by watching any kind of system load monitor will be limited.The reason is that worker threads which don't have Chapel tasks to runand are looking for work will load their respective CPUs pretty heavilyas they pound on queues looking for work. There is nothing inherentlydifferent in the system monitor output that would allow distinguishingbetween cores hosting worker threads that are running Chapel tasks andcores hosting worker threads that are looking (perhaps in vain) forChapel tasks to run.

The best way to get predictable behavior out any of the tasking layersis to give them plenty of work, but let them choose how to place thatwork on the hardware. Don't second-guess them. In practice this means:


 * Big loops (more iterations and/or larger loop bodies) are better
   than small ones. No tasking layer (in any programming model) will
   work well on a parallel loop whose effective per-task work isn't
   significantly greater than the overhead of managing the parallelism.
 * Don't use a coforall-stmt for a loop with many more iterations than
   you expect to have CPUs. The loop cannot effectively be more
   parallel than the number of CPUs anyway, so forcing the creation of
   more tasks than this will generally just increase overhead (CPU
   time) without reducing wall time. Generally forall-stmts are a
   better way to express loop parallelism.
 * Don't adjust the parameters described in README.tasks unless you
   know (from prior experimentation, say) that doing so yields better
   performance for the particular program you're running. Or, of
   course, because you're experimenting precisely to see the effect of
   such changes.
 * Don't run oversubscribed with CHPL_COMM!=none, that is, executing
   multiple Chapel locales but on a single compute node. The tasking
   layers assume that the hardware they see all belongs to them and if
   there is significant competition from other things either inside the
   program or outside, they can make bad decisions from a performance
   point of view.

I'd have to have more information about your test case(s) to sayanything specifically about why you were seeing the CPU loading you werewith qthreads tasking. We could tackle that separately from this moregeneral email.

2. Does a 'for param' unroll within a cobegin?  One program
     only ran serially with this setup, while a coforall over the
     parameter ran in parallel.  But maybe this is related to the
     first question.
     The code skeleton looked like
        cobegin {
          for param bank in 1..nbank do run_filter(bank);
          /* process result */
        }
     vs.
        coforall bank in 1..nbank {
          run_filter(bank);
        /* process result */
        }
     where the number of banks was small, about 20, but too many
     to write out.

No, the for-stmt will not unroll. It will just execute serially (as italways does) within the single task created by the cobegin-stmt.

3. Some errors are trapped by the runtime, others are not and
     just exit with a short message to the console.  Examples
     include segfaults and floating point exceptions.  Is it
     possible to print the line number where the error occurred
     (as staring at the code waiting for enlightenment isn't the
     fastest way to finding the problem)?  Or, what way do you
     recommend to debug problems like this?

There's an admittedly imperfect distinction in internal code betweeninternal checks for erroneous situations that are in some sensepredictable and those that aren't predictable. For the unpredictableones (no runtime message) you can always compile with -g and set yourcore file limit appropriately to allow dumping core for, say, asegfault. However, looking at core files corresponding to Chapelprograms is something of a black art due to the fact that the Chapelcompiler produces C code and re-compiles that with a C compiler toproduce an executable. I confess I'm a little surprised you're gettingsegfaults unless you're compiling with --fast or something else thatturns off checks. The obvious things that would cause segfaults, such asarray mis-indexing, should be caught by the checks. Perhaps the mostcommon reason for a segfault would be a stack overflow, since these are"detected" by use of inaccessible guard pages which do precisely that:cause a segfault when the stack is overflowed. Are the segfaultingprograms written in such a way that they might require large taskstacks? For example, do they have large arrays local to Chapel procs? (Imay need some help here from other Chapel folks since I'm primarily aruntime person and the decision as to which Chapel variables are placedon the stack and which are placed in the heap is a bit of a black box tome.)

Floating point halts you'll probably have to chase by means of coredumps, unfortunately.

A potential alternative to core dumps would be to build withCHPL_COMM=gasnet, rebuild Chapel and your test case(s), and then run onjust a single locale with '-nl 1'. The GASNet package that underlies ourgasnet communication layer will catch most signals and produce atraceback of all program threads. This turns off core file productionbut can be a quicker way to get a traceback. To reduce the amount ofoutput you have to wade through it might be helpful to limit the numberof tasking runtime threads, by setting theCHPL_RT_NUM_THREADS_PER_LOCALE environment variable (see README.tasksagain). Don't set it below 2, though, or you can cause deadlock in yourChapel program.

4. We have one program that runs for a small number of iterations
     but dies with a slightly larger (300 trials instead of 100).
     The load on the CPU seems normal, but it will just stop with a
     succinct message on the console: "Killed".  How do we find out
     what's causing the problem?

"Killed" is printed by the system when a process is killed by a SIGKILLsignal. The most common reason for this is the system running out ofmemory, including swap space. When this happens a thingie (for lack of abetter word) in the kernel called the "OOM killer" (OOM == Out OfMemory, google "OOM killer" for lots of info) makes a best guess as tothe offending process and kills it. It's possible this is related toyour question #3. If the product of your task count and per-task memoryrequirements were big enough I could see this happening. What does theloop structure look like (for/forall/coforall, nesting, etc.) and whatare the per-iteration memory requirements?


Hope this helps!

greg

Thanks again for the help,

Greg

------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

------------------------------------------------------------------------------

_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: runtime questions

Reply via email to