Hi, On 2020-01-14 23:27:02 -0800, Andres Freund wrote: > On 2020-01-14 17:54:16 -0500, Tom Lane wrote: > > Andres Freund <and...@anarazel.de> writes: > > > I'm still not sure I actually fully understand the bug. It's obvious how > > > returning the input value again could lead to memory not being freed (so > > > that leak seems to go all the way back). And similarly, since the > > > introduction of expanded objects, it can also lead to the expanded > > > object not being deleted. > > > But that's not the problem causing the crash here. What I think must > > > instead be the problem is that pergroupstate->transValueIsNull, but > > > pergroupstate->transValue is set to something looking like a > > > pointer. Which caused us not to datumCopy() a new transition value into > > > a long lived context. and then a later transition causes us to free the > > > short-lived value? > > > > Yeah, I was kind of wondering that too. While formally the Datum value > > for a null is undefined, I'm not aware offhand of any functions that > > wouldn't return zero --- and this would have to be an aggregate transition > > function doing so, which reduces the universe of candidates quite a lot. > > Plus there's the question of how often a transition function would return > > null for non-null input at all. > > > > Could we see a test case that provokes this crash, even if it doesn't > > do so reliably? > > There's a larger reproducer referenced in the first message. I had hoped > that Teodor could narrow it down - I guess I'll try to do that tomorrow...
FWIW, I'm working on narrowing it down to something small. I can reliably trigger the bug, and I understand the mechanics, I think. Interestingly enough the reproducer currently only triggers on v12, not on v11 and before. As you say, this requires a transition function returning a NULL that has the datum part set - the reproducer here defines a non-strict aggregate transition function that can indirectly do so: CREATE FUNCTION public.state_max_bytea(st bytea, inp bytea) RETURNS bytea LANGUAGE plpgsql AS $$ BEGIN if st is null then return inp; elseif st<inp then return inp; else return st; end if; END;$$; CREATE AGGREGATE public.max(bytea) ( SFUNC = public.state_max_bytea, STYPE = bytea ); I.e. when the current transition is null (e.g. for the first tuple), the transition is always set to new input value. Even if that is null. Then the question in turn is, how the input datum is != 0, but has isnull set. And that's caused by: EEO_CASE(EEOP_FUNCEXPR_STRICT) { FunctionCallInfo fcinfo = op->d.func.fcinfo_data; NullableDatum *args = fcinfo->args; int argno; Datum d; /* strict function, so check for NULL args */ for (argno = 0; argno < op->d.func.nargs; argno++) { if (args[argno].isnull) { *op->resnull = true; goto strictfail; } } fcinfo->isnull = false; d = op->d.func.fn_addr(fcinfo); *op->resvalue = d; *op->resnull = fcinfo->isnull; strictfail: EEO_NEXT(); } I.e. if the transitions argument is a strict function, and that strict function is not evaluated because of a NULL input, we set op->resnull = true, but do *not* touch op->resvalue. If there was a previous row that actually set resvalue to something meaningful, we get an input to the transition function consisting out of the old resvalue (!= 0), but the new resnull = true. If the transition function returns that unchanged, ExecAggTransReparent() doesn't do anything, because the new value is null. Afterwards pergroup->transValue is set != 0, even though transValueIsNull = true. The somewhat tricky bit is arranging this to happen with pointers that are the same. I think I'm on the way to narrow that down, but it'll take me a bit longer. To fix this I think we should set newVal = 0 in ExecAggTransReparent()'s, as a new else to !newValueIsNull. That should not add any additional branches, I think. I contrast to always doing so when checking whether ExecAggTransReparent() ought to be called. Greetings, Andres Freund