On Wed, Nov 11, 2020 at 12:43:50PM +0100, Maciej Zdeb wrote:
> Wow! Yes, I can confirm that a crash does not occur now. :) I checked 2.0
> and 2.2 branches. I'll keep testing it for a couple days just to be sure.
> 
> So that stacktrace I shared before (on spoe_release_appctx function) was
> very lucky... Do you think that it'd be possible to find the bug without
> the replication procedure?

Very hardly. I actually continued on your indications and noticed that
each time I had a crash, a pointer that was supposedly aligned had
regressed by one. This reminded me of the NULL pointer that became -1.
I thought it was related to the pools since it often crashed there, and
in parallel Christopher looked for decrements in the SPOE part and found
that some nulls were missing there on aborts.

> Christopher & Willy many thanks for your hard work!

Let me return you the compliment! Two months of chasing a non reproducible
memory corruption with zero initial info is quite an achievement, many
thanks for doing that!

> I'm always impressed
> how fast you're able to narrow the bug when you finally get proper input
> from a reporter. :)

It's very simple, the code is huge and any piece could be responsible for
any problem. Sometimes you have a good nose and manage to narrow down the
issue in an area. Sometimes you just read a piece of code and figure it
can do something nasty. Sometimes other reports come in and help rule out
other hypothesis. But when there's nothing logical, most often it's a memory
corruption and then there's no other solution than being able to observe it
live and heavily instrument the code to go back in time from the crash to
the cause. In your case we were lucky, threads were not involved, otherwise
this adds another dimension, and very often the instrumentation code changes
the timings and makes the issue disappear :-)

Cheers,
Willy

Reply via email to