Hi Greg,

Sorry about not getting a reply off quicker - I've been working
on the bug for #4, where the program would crash with a "Killed"
message.  Managed to convert it to a "halt reached - array
reference count is negative!" error, and I think it's a problem
with Chapel - I'll submit it to the bug list.

> I'd have to have more information about your test case(s) to say
> anything specifically about why you were seeing the CPU loading you were
> with qthreads tasking. We could tackle that separately from this more
> general email.

Thanks for the detailed description; it helps (and the end advice -
trust the system - wasn't unexpected!)  In a week or two I should
have the program that triggered this behavior released with a test
case.  Maybe we can look at it then?

>> 3. Some errors are trapped by the runtime, others are not and
>>      just exit with a short message to the console.  Examples
>>      include segfaults and floating point exceptions.  Is it
>>      possible to print the line number where the error occurred
>>      (as staring at the code waiting for enlightenment isn't the
>>      fastest way to finding the problem)?  Or, what way do you
>>      recommend to debug problems like this?
>
> There's an admittedly imperfect distinction in internal code between
> internal checks for erroneous situations that are in some sense
> predictable and those that aren't predictable. For the unpredictable
> ones (no runtime message) you can always compile with -g and set your
> core file limit appropriately to allow dumping core for, say, a
> segfault. However, looking at core files corresponding to Chapel
> programs is something of a black art due to the fact that the Chapel
> compiler produces C code and re-compiles that with a C compiler to
> produce an executable. I confess I'm a little surprised you're getting
> segfaults unless you're compiling with --fast or something else that
> turns off checks. The obvious things that would cause segfaults, such as
> array mis-indexing, should be caught by the checks. Perhaps the most
> common reason for a segfault would be a stack overflow, since these are
> "detected" by use of inaccessible guard pages which do precisely that:
> cause a segfault when the stack is overflowed. Are the segfaulting
> programs written in such a way that they might require large task
> stacks? For example, do they have large arrays local to Chapel procs? (I
> may need some help here from other Chapel folks since I'm primarily a
> runtime person and the decision as to which Chapel variables are placed
> on the stack and which are placed in the heap is a bit of a black box to
> me.)
>

 >
 >> 4. We have one program that runs for a small number of iterations
 >>      but dies with a slightly larger (300 trials instead of 100).
 >>      The load on the CPU seems normal, but it will just stop with a
 >>      succinct message on the console: "Killed".  How do we find out
 >>      what's causing the problem?
 >
 > "Killed" is printed by the system when a process is killed by a SIGKILL
 > signal. The most common reason for this is the system running out of
 > memory, including swap space. When this happens a thingie (for lack of a
 > better word) in the kernel called the "OOM killer" (OOM == Out Of
 > Memory, google "OOM killer" for lots of info) makes a best guess as to
 > the offending process and kills it. It's possible this is related to
 > your question #3. If the product of your task count and per-task memory
 > requirements were big enough I could see this happening. What does the
 > loop structure look like (for/forall/coforall, nesting, etc.) and what
 > are the per-iteration memory requirements?
 >

How big are the stack limits?  The arrays we work with (for image
processing) are on the order of 10-20 MB.  We did trace one problem
to an array stored in a record, and suspect that the dynamics of
how records are passed around to procedures was the problem.

It's not an OOM problem.  The code was running serially (only for
loops), the arrays were on the order of a thousand elements - let's
say 10 or 20 arrays in total, so it shouldn't have been anywhere near
a system limit.  Are there other reasons for a SIGKILL?


None of these problems were unsolvable, by the way.  It was possible
to change the code to get around them, and I'm not sure if it was
us trying to do something that is questionable in the language, or
if it was a problem with Chapel's implementation (given that it's
still somewhat a work in progress).  The main annoyance is that the
messages are unhelpful, and finding the fix takes guesswork and is
fairly tedious.  When the runtime checks fail, you get an error
message pointing to where in the code to look, and you can figure
out what's going wrong.  But when there's a heap of generated code
between the object and source files, that's not really possible.

Might it possible to generate a map of the C lines that correspond
to each Chapel source line, so that you could use the normal C
debugging tools to find out where the error occurred, and then trace
it back to Chapel?  Bonus points if there was a way to do that
automatically ...


Ach, these are hard problems to debug!  And it's worse when there's
only vague descriptions of transient problems.  It would be nice if
there were some way to make it clearer what was going wrong.

Greg

------------------------------------------------------------------------------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to