I have a Julia application that uses MPI to communicate between several 
processes. Each process uses many tasks, and they send functions to remote 
locations to be executed.

If I use a large number of tasks per process, I receive segfaults. 
Sometimes I am able to obtain a stack backtrace, and these segfaults 
usually occur in array.c or in gc.c in routines related to memory 
allocation, often for increasing the buffer size for serialization. I've 
added a few assert statements there and examined the code, and it seems 
that these routines themselves are not to blame. My next assumption is thus 
that, somewhere, someone is overwriting memory, and libc's malloc's 
internal data structures are accidentally overwritten.

- Do you have pointers for debugging this in Julia?
- Is there a "memory-debug" mode for Julia, for its garbage collector, for 
flisp, for flisp's garbage collector, ...?
- Is there a way to rebuild Julia with more aggressive self-checking 
enabled?

I can reproduce the error quite reliably, but it always occurs at a 
different place. Unfortunately, the error goes away if I reduce the number 
of tasks or the number of processes 
<https://en.wikipedia.org/wiki/Heisenbug>.

-erik

Reply via email to