Mark.
When valgrind is not feasible (like on many centrally controlled batch
systems) you can run PETSc with an extra flag to do some memory error checks
-malloc_debug
this
1) fills all malloced memory with Nan so if the code is using uninitialized
memory it may be detected and
2) checks the beginning and end of each alloced memory region for out-of-bounds
writes at each malloc and free.
it will slow the code down a little bit but generally not a huge amount.
It is no where near as good as valgrind or other memory corruption tools but it
has the advantage you can run it anywhere on any size job.
Barry
> On Aug 12, 2020, at 7:46 AM, Matthew Knepley <[email protected]> wrote:
>
> On Wed, Aug 12, 2020 at 7:53 AM Mark Lohry <[email protected]
> <mailto:[email protected]>> wrote:
> I'm getting seemingly random failures of late:
> Caught signal number 7 BUS: Bus Error, possibly illegal memory access
>
> The first thing I would do is run valgrind on as wide an array of tests as
> you can. This will find problems
> on things that run completely fine.
>
> Thanks,
>
> Matt
>
> Symptoms:
> 1) Seems to only happen (so far) on larger cases, 400-2000 cores
> 2) It doesn't happen right away -- this was running happily for several hours
> over several hundred time steps with no indication of bad health in the
> numerics
> 3) At least the total memory consumption seems to be within bounds, though
> I'm not sure about individual processes. e.g. slurm here reported Memory
> Efficiency: 75.23% of 1.76 TB (180.00 GB/node)
> 4) running the same setup twice it fails at different points
>
> Any suggestions on what to look for? This is a bit painful to work on as I
> can only reproduce it on large runs and then it's seemingly random.
>
>
> Thanks,
> Mark
>
>
> --
> What most experimenters take for granted before they begin their experiments
> is infinitely more interesting than any results to which their experiments
> lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>