Do you potentially have a memory or other resource leak? SIGBUS would be an odd result, but the symptom of crashing after running for a long time sometimes fits with a resource leak.
Mark Lohry <[email protected]> writes: > I queued up some jobs with Barry's patch, so we'll see. > > Re Jed's suggestion at checkpointing, I don't *think* this is something > coming from the state of the solution -- running from the same point I'm > seeing it crash anywhere between 1 hour and 20 hours in. I'll increase my > file save frequency in case I'm wrong there though. > > My intel build with different blas just made it through a 6 hour time slot > without crash, whereas yesterday the same thing crashed after 3 hours. But > given the randomness so far I'd bet that's just dumb luck. > > On Mon, Aug 24, 2020 at 4:22 PM Barry Smith <[email protected]> wrote: > >> >> >> > On Aug 24, 2020, at 2:34 PM, Jed Brown <[email protected]> wrote: >> > >> > I'm thinking of something such as writing floating point data into the >> return address, which would be unaligned/garbage. >> >> Ok, my patch will detect this. This is what I was talking about, messing >> up the BLAS arguments which are the addresses of arrays. >> >> Valgrind is by far the preferred approach. >> >> Barry >> >> Another feature we could add to the malloc checking is when a SEGV or >> BUS error is encountered and we catch it we should run the >> PetscMallocVerify() and check our memory for corruption reporting any we >> find. >> >> >> >> > >> > Reproducing under Valgrind would help a lot. Perhaps it's possible to >> checkpoint such that the breakage can be reproduced more quickly? >> > >> > Barry Smith <[email protected]> writes: >> > >> >> https://en.wikipedia.org/wiki/Bus_error < >> https://en.wikipedia.org/wiki/Bus_error> >> >> >> >> But perhaps not true for Intel? >> >> >> >> >> >> >> >>> On Aug 24, 2020, at 1:06 PM, Matthew Knepley <[email protected]> >> wrote: >> >>> >> >>> On Mon, Aug 24, 2020 at 1:46 PM Barry Smith <[email protected] <mailto: >> [email protected]>> wrote: >> >>> >> >>> >> >>>> On Aug 24, 2020, at 12:39 PM, Jed Brown <[email protected] <mailto: >> [email protected]>> wrote: >> >>>> >> >>>> Barry Smith <[email protected] <mailto:[email protected]>> writes: >> >>>> >> >>>>>> On Aug 24, 2020, at 12:31 PM, Jed Brown <[email protected] <mailto: >> [email protected]>> wrote: >> >>>>>> >> >>>>>> Barry Smith <[email protected] <mailto:[email protected]>> writes: >> >>>>>> >> >>>>>>> So if a BLAS errors with SIGBUS then it is always an input error >> of just not proper double/complex alignment? Or some other very strange >> thing? >> >>>>>> >> >>>>>> I would suspect memory corruption. >> >>>>> >> >>>>> >> >>>>> Corruption meaning what specifically? >> >>>>> >> >>>>> The routines crashing are dgemv which only take double precision >> arrays, regardless of what garbage is in those arrays i don't think there >> can be BUS errors resulting. They don't take integer arrays whose >> corruption could result in bad indexing and then BUS errors. >> >>>>> >> >>>>> So then it can only be corruption of the pointers passed in, correct? >> >>>> >> >>>> Such as those pointers pointing into data on the stack with incorrect >> sizes. >> >>> >> >>> But won't incorrect sizes "usually" lead to SEGV not SEGBUS? >> >>> >> >>> My understanding was that roughly memory errors in the heap are SEGV >> and memory errors on the stack are SIGBUS. Is that not true? >> >>> >> >>> Matt >> >>> >> >>> -- >> >>> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> >>> -- Norbert Wiener >> >>> >> >>> https://www.cse.buffalo.edu/~knepley/ < >> http://www.cse.buffalo.edu/~knepley/> >> >>
