I'm getting seemingly random failures of late: Caught signal number 7 BUS: Bus Error, possibly illegal memory access
Symptoms: 1) Seems to only happen (so far) on larger cases, 400-2000 cores 2) It doesn't happen right away -- this was running happily for several hours over several hundred time steps with no indication of bad health in the numerics 3) At least the total memory consumption seems to be within bounds, though I'm not sure about individual processes. e.g. slurm here reported Memory Efficiency: 75.23% of 1.76 TB (180.00 GB/node) 4) running the same setup twice it fails at different points Any suggestions on what to look for? This is a bit painful to work on as I can only reproduce it on large runs and then it's seemingly random. Thanks, Mark
