On Wed, Nov 11, 2020 at 12:43:50PM +0100, Maciej Zdeb wrote: > Wow! Yes, I can confirm that a crash does not occur now. :) I checked 2.0 > and 2.2 branches. I'll keep testing it for a couple days just to be sure. > > So that stacktrace I shared before (on spoe_release_appctx function) was > very lucky... Do you think that it'd be possible to find the bug without > the replication procedure?
Very hardly. I actually continued on your indications and noticed that each time I had a crash, a pointer that was supposedly aligned had regressed by one. This reminded me of the NULL pointer that became -1. I thought it was related to the pools since it often crashed there, and in parallel Christopher looked for decrements in the SPOE part and found that some nulls were missing there on aborts. > Christopher & Willy many thanks for your hard work! Let me return you the compliment! Two months of chasing a non reproducible memory corruption with zero initial info is quite an achievement, many thanks for doing that! > I'm always impressed > how fast you're able to narrow the bug when you finally get proper input > from a reporter. :) It's very simple, the code is huge and any piece could be responsible for any problem. Sometimes you have a good nose and manage to narrow down the issue in an area. Sometimes you just read a piece of code and figure it can do something nasty. Sometimes other reports come in and help rule out other hypothesis. But when there's nothing logical, most often it's a memory corruption and then there's no other solution than being able to observe it live and heavily instrument the code to go back in time from the crash to the cause. In your case we were lucky, threads were not involved, otherwise this adds another dimension, and very often the instrumentation code changes the timings and makes the issue disappear :-) Cheers, Willy