Hi,

On 2020-01-14 23:27:02 -0800, Andres Freund wrote:
> On 2020-01-14 17:54:16 -0500, Tom Lane wrote:
> > Andres Freund <and...@anarazel.de> writes:
> > > I'm still not sure I actually fully understand the bug. It's obvious how
> > > returning the input value again could lead to memory not being freed (so
> > > that leak seems to go all the way back). And similarly, since the
> > > introduction of expanded objects, it can also lead to the expanded
> > > object not being deleted.
> > > But that's not the problem causing the crash here. What I think must
> > > instead be the problem is that pergroupstate->transValueIsNull, but
> > > pergroupstate->transValue is set to something looking like a
> > > pointer. Which caused us not to datumCopy() a new transition value into
> > > a long lived context. and then a later transition causes us to free the
> > > short-lived value?
> >
> > Yeah, I was kind of wondering that too.  While formally the Datum value
> > for a null is undefined, I'm not aware offhand of any functions that
> > wouldn't return zero --- and this would have to be an aggregate transition
> > function doing so, which reduces the universe of candidates quite a lot.
> > Plus there's the question of how often a transition function would return
> > null for non-null input at all.
> >
> > Could we see a test case that provokes this crash, even if it doesn't
> > do so reliably?
>
> There's a larger reproducer referenced in the first message. I had hoped
> that Teodor could narrow it down - I guess I'll try to do that tomorrow...

FWIW, I'm working on narrowing it down to something small. I can
reliably trigger the bug, and I understand the mechanics, I
think. Interestingly enough the reproducer currently only triggers on
v12, not on v11 and before.

As you say, this requires a transition function returning a NULL that
has the datum part set - the reproducer here defines a non-strict
aggregate transition function that can indirectly do so:

CREATE FUNCTION public.state_max_bytea(st bytea, inp bytea) RETURNS bytea
    LANGUAGE plpgsql
    AS $$
    BEGIN
         if st is null
         then
            return inp;
         elseif st<inp then
            return inp;
         else
            return st;
         end if;
    END;$$;

CREATE AGGREGATE public.max(bytea) (
    SFUNC = public.state_max_bytea,
    STYPE = bytea
);

I.e. when the current transition is null (e.g. for the first tuple), the
transition is always set to new input value. Even if that is null.

Then the question in turn is, how the input datum is != 0, but has
isnull set. And that's caused by:


                EEO_CASE(EEOP_FUNCEXPR_STRICT)
                {
                        FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
                        NullableDatum *args = fcinfo->args;
                        int                     argno;
                        Datum           d;

                        /* strict function, so check for NULL args */
                        for (argno = 0; argno < op->d.func.nargs; argno++)
                        {
                                if (args[argno].isnull)
                                {
                                        *op->resnull = true;
                                        goto strictfail;
                                }
                        }
                        fcinfo->isnull = false;
                        d = op->d.func.fn_addr(fcinfo);
                        *op->resvalue = d;
                        *op->resnull = fcinfo->isnull;

        strictfail:
                        EEO_NEXT();
                }


I.e. if the transitions argument is a strict function, and that strict
function is not evaluated because of a NULL input, we set op->resnull =
true, but do *not* touch op->resvalue. If there was a previous row that
actually set resvalue to something meaningful, we get an input to the
transition function consisting out of the old resvalue (!= 0), but the
new resnull = true.  If the transition function returns that unchanged,
ExecAggTransReparent() doesn't do anything, because the new value is
null. Afterwards pergroup->transValue is set != 0, even though
transValueIsNull = true.

The somewhat tricky bit is arranging this to happen with pointers that
are the same. I think I'm on the way to narrow that down, but it'll take
me a bit longer.

To fix this I think we should set newVal = 0 in
ExecAggTransReparent()'s, as a new else to !newValueIsNull. That should
not add any additional branches, I think. I contrast to always doing so
when checking whether ExecAggTransReparent() ought to be called.

Greetings,

Andres Freund


Reply via email to