Hi Greg --
I'm Greg Titus and I work on the Chapel runtime, with an emphasis on
tasking, low level memory management, and inter-locale communication.
Some of what I've written here will likely already be familiar to you,
but I've tried to err on the side of giving more info in the hope that
will be useful to others.
On 5/27/2015 6:53 AM, Greg Kreider wrote:
Good morning.
A few questions about the runtime -
1. Is there any way to predict how a program's tasks will be
executed on a given piece of hardware? We've seen the
performance of programs change depending on variations in
the implementation and don't understand why it happens. A
few examples. One program has an outer loop and calls a
couple subroutines that use forall's, others that use for.
If we use a for in the outer loop then one core on the CPU
is loaded 100% and the other 3 at 40%. Switching to a
forall or coforall, everything runs serially on one core.
Another program using cobegin never ran on more than two
cores. This is using qthread + hwloc. Using the quickstart
compiler programs would saturate all the cores.
You can get information about when tasks launch; is there
some way to understand why the runtime is making the choices
it does for this behavior? It would be better to make
intelligent design decisions rather than randomly swapping
for, forall, and coforall to see the effect - it's all a bit
of a black box.
The most general answer is that it can be difficult and perhaps
impossible to predict how tasks will be executed on a given piece of
hardware. But there are some things you can do to improve the situation.
First, just so everything is clear, for-stmts are fully serial,
coforall-stmts are fully parallel (one Chapel task per iteration), and
forall-stmts lie between. In particular, the number of Chapel tasks
created for a forall-stmt is determined (perhaps indirectly) by the
iterator controlling the loop. By default it's the number of hardware
cores (cores, not hyper-threads or the like) on the hardware; this can
be adjusted when you run the program by using any of several config
params documented in doc/README.tasks (or doc/release/README.tasks if
you're working with a copy of the repo instead of a release tarball).
The basic idea is that we're trying to maximize use of the hardware and
thus reduce wall time, while adding the least amount of overhead we can.
The qthreads and fifo tasking layer implementations work similarly to
each other, with differences in the details. In both cases some number
of worker threads scan work queues, looking for Chapel tasks to run. In
the qthreads tasking layer, with is our current default, there are a
number of task queues depending on the number of cores. There is one
worker thread tied to each core, with typically more than one worker
sharing a queue. The workers are created when the program starts. Tasks
are distributed into the queues, and the workers compete to run them.
(Without going even further into the details than I already am, you can
imagine that load balancing among the workers is done by a combination
of round-robin task distribution amongst the queues and perhaps work
stealing across them.) If a task suspends, for example because it
accesses a sync variable that is not in the desired full/empty state,
then the worker can put the task aside to be continued later, and pick
up another task that is runnable. Thus task switching is done in user
space (no kernel intervention) and is lightweight. An observable side
effect of the qthreads tasking layer behavior is that serial programs
may run longer than with the fifo tasking layer described next, due to
qthreads starting worker threads that will end up not being used.
In the fifo tasking layer there is one queue, worker threads are created
on the fly without bound as the runtime sees that there are more tasks
in the queue that could be run, tasks occupy their host thread
throughout their lifespan, and threads that have completed their tasks
hang around looking for more work. A task in the queue will start quite
rapidly if it is picked up by a thread looking for work, but if it has
to wait for a thread to be created it will start more slowly since
thread creation is expensive (a kernel syscall). If a task suspends,
again for example because it accesses a sync variable that is not in the
desired full/empty state, then the worker retains the task but yields
the core. Thus task switching is effectively done in the kernel and is
heavier weight than in qthreads tasking. There are a couple of
observable side effects of the overall fifo tasking layer behavior. One
is that the first parallel construct in a program may run much more
slowly than succeeding ones due to ramping up worker threads. Another is
that even if a given parallel construct creates more tasks than there
are existing threads, you may not get full task parallelism if the task
bodies are short compared to how long it takes to make a thread, because
in this case even if the runtime creates more threads because the task
queue has grown, already-existing threads can mow through the tasks
serially before the new threads can be created and arrive to help.
There are several caveats with all of the above. One is that your
ability to deduce useful information about the program's tasking
behavior by watching any kind of system load monitor will be limited.
The reason is that worker threads which don't have Chapel tasks to run
and are looking for work will load their respective CPUs pretty heavily
as they pound on queues looking for work. There is nothing inherently
different in the system monitor output that would allow distinguishing
between cores hosting worker threads that are running Chapel tasks and
cores hosting worker threads that are looking (perhaps in vain) for
Chapel tasks to run.
The best way to get predictable behavior out any of the tasking layers
is to give them plenty of work, but let them choose how to place that
work on the hardware. Don't second-guess them. In practice this means:
* Big loops (more iterations and/or larger loop bodies) are better
than small ones. No tasking layer (in any programming model) will
work well on a parallel loop whose effective per-task work isn't
significantly greater than the overhead of managing the parallelism.
* Don't use a coforall-stmt for a loop with many more iterations than
you expect to have CPUs. The loop cannot effectively be more
parallel than the number of CPUs anyway, so forcing the creation of
more tasks than this will generally just increase overhead (CPU
time) without reducing wall time. Generally forall-stmts are a
better way to express loop parallelism.
* Don't adjust the parameters described in README.tasks unless you
know (from prior experimentation, say) that doing so yields better
performance for the particular program you're running. Or, of
course, because you're experimenting precisely to see the effect of
such changes.
* Don't run oversubscribed with CHPL_COMM!=none, that is, executing
multiple Chapel locales but on a single compute node. The tasking
layers assume that the hardware they see all belongs to them and if
there is significant competition from other things either inside the
program or outside, they can make bad decisions from a performance
point of view.
I'd have to have more information about your test case(s) to say
anything specifically about why you were seeing the CPU loading you were
with qthreads tasking. We could tackle that separately from this more
general email.
2. Does a 'for param' unroll within a cobegin? One program
only ran serially with this setup, while a coforall over the
parameter ran in parallel. But maybe this is related to the
first question.
The code skeleton looked like
cobegin {
for param bank in 1..nbank do run_filter(bank);
/* process result */
}
vs.
coforall bank in 1..nbank {
run_filter(bank);
/* process result */
}
where the number of banks was small, about 20, but too many
to write out.
No, the for-stmt will not unroll. It will just execute serially (as it
always does) within the single task created by the cobegin-stmt.
3. Some errors are trapped by the runtime, others are not and
just exit with a short message to the console. Examples
include segfaults and floating point exceptions. Is it
possible to print the line number where the error occurred
(as staring at the code waiting for enlightenment isn't the
fastest way to finding the problem)? Or, what way do you
recommend to debug problems like this?
There's an admittedly imperfect distinction in internal code between
internal checks for erroneous situations that are in some sense
predictable and those that aren't predictable. For the unpredictable
ones (no runtime message) you can always compile with -g and set your
core file limit appropriately to allow dumping core for, say, a
segfault. However, looking at core files corresponding to Chapel
programs is something of a black art due to the fact that the Chapel
compiler produces C code and re-compiles that with a C compiler to
produce an executable. I confess I'm a little surprised you're getting
segfaults unless you're compiling with --fast or something else that
turns off checks. The obvious things that would cause segfaults, such as
array mis-indexing, should be caught by the checks. Perhaps the most
common reason for a segfault would be a stack overflow, since these are
"detected" by use of inaccessible guard pages which do precisely that:
cause a segfault when the stack is overflowed. Are the segfaulting
programs written in such a way that they might require large task
stacks? For example, do they have large arrays local to Chapel procs? (I
may need some help here from other Chapel folks since I'm primarily a
runtime person and the decision as to which Chapel variables are placed
on the stack and which are placed in the heap is a bit of a black box to
me.)
Floating point halts you'll probably have to chase by means of core
dumps, unfortunately.
A potential alternative to core dumps would be to build with
CHPL_COMM=gasnet, rebuild Chapel and your test case(s), and then run on
just a single locale with '-nl 1'. The GASNet package that underlies our
gasnet communication layer will catch most signals and produce a
traceback of all program threads. This turns off core file production
but can be a quicker way to get a traceback. To reduce the amount of
output you have to wade through it might be helpful to limit the number
of tasking runtime threads, by setting the
CHPL_RT_NUM_THREADS_PER_LOCALE environment variable (see README.tasks
again). Don't set it below 2, though, or you can cause deadlock in your
Chapel program.
4. We have one program that runs for a small number of iterations
but dies with a slightly larger (300 trials instead of 100).
The load on the CPU seems normal, but it will just stop with a
succinct message on the console: "Killed". How do we find out
what's causing the problem?
"Killed" is printed by the system when a process is killed by a SIGKILL
signal. The most common reason for this is the system running out of
memory, including swap space. When this happens a thingie (for lack of a
better word) in the kernel called the "OOM killer" (OOM == Out Of
Memory, google "OOM killer" for lots of info) makes a best guess as to
the offending process and kills it. It's possible this is related to
your question #3. If the product of your task count and per-task memory
requirements were big enough I could see this happening. What does the
loop structure look like (for/forall/coforall, nesting, etc.) and what
are the per-iteration memory requirements?
Hope this helps!
greg
Thanks again for the help,
Greg
------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users
------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users