Hi, On 2020-01-15 12:47:47 -0800, Andres Freund wrote: > FWIW, I'm working on narrowing it down to something small. I can > reliably trigger the bug, and I understand the mechanics, I > think. Interestingly enough the reproducer currently only triggers on > v12, not on v11 and before.
That's just happenstance due to allocation changes in plpgsql, though. The attached small reproducer, for me, reliably triggers crashes on 10 - master. It's hard to hit intentionally, because plpgsql does a datumCopy() to its non-null return value, which means that to hit the bug, one needs different numbers of allocations between setting up the transition value with transvalueisnull = true, transvalue = 0xsomepointer (because plpgsql doesn't copy NULLs), and the transition output with transvalueisnull = false, transvalue = 0xsomepointer. Which is necessary to trigger the bug, as it's then not reparented into a long lived enough context. To be then freed/accessed for the next group input value. I think this is too finnicky to actually keep as a regression test. The bug, in a way, exists all the way back, but it's a bit harder to create NULL values where the datum component isn't 0. To fix I suggest we, in all branches, do the equivalent of adding something like: diff --git i/src/backend/executor/execExprInterp.c w/src/backend/executor/execExprInterp.c index 790380051be..3260a63ac6b 100644 --- i/src/backend/executor/execExprInterp.c +++ w/src/backend/executor/execExprInterp.c @@ -4199,6 +4199,12 @@ ExecAggTransReparent(AggState *aggstate, AggStatePerTrans pertrans, pertrans->transtypeByVal, pertrans->transtypeLen); } + else + { + /* ensure datum component is 0 for NULL transition values */ + newValue = (Datum) 0; + } + if (!oldValueIsNull) { if (DatumIsReadWriteExpandedObject(oldValue, and a comment explaining why it's (now) safe to rely on datum comparisons for if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue)) I don't think it makes sense to add something like it to the byval case - there's plenty other ways a function returning != 0 with fcinfo->isnull == true can cause such values to exist. And that's longstanding. A separate question is whether it's worth adding code to e.g. EEO_CASE(EEOP_FUNCEXPR_STRICT) also resetting *op->resvalue to (Datum) 0. I don't personally don't think ensuring the datum is always 0 when isnull true is all that helpful, if we can't guarantee it everywhere. So I'm a bit loathe to add cycles to places that don't need it, and are hot. Regards, Andres
xrepro.sql
Description: application/sql