I have a Julia application that uses MPI to communicate between several processes. Each process uses many tasks, and they send functions to remote locations to be executed.
If I use a large number of tasks per process, I receive segfaults. Sometimes I am able to obtain a stack backtrace, and these segfaults usually occur in array.c or in gc.c in routines related to memory allocation, often for increasing the buffer size for serialization. I've added a few assert statements there and examined the code, and it seems that these routines themselves are not to blame. My next assumption is thus that, somewhere, someone is overwriting memory, and libc's malloc's internal data structures are accidentally overwritten. - Do you have pointers for debugging this in Julia? - Is there a "memory-debug" mode for Julia, for its garbage collector, for flisp, for flisp's garbage collector, ...? - Is there a way to rebuild Julia with more aggressive self-checking enabled? I can reproduce the error quite reliably, but it always occurs at a different place. Unfortunately, the error goes away if I reduce the number of tasks or the number of processes <https://en.wikipedia.org/wiki/Heisenbug>. -erik
