Hi everyone,
Yes the current code is a bit difficult to reason about due to the fact
that we rely on nodes marking their finished_ future as finished at the
proper times. The various ExecNodes were further inconsistent about when
they created their futures and where, and sometimes they'd overwrite their
old ones, which makes it even harder to reason about.

Regarding the specific issue of StartAndCollect, there is a race that can
happen if you don't wait for the plan's finished_ future: StartAndCollect
returns a future that's a combination of the outputting AsyncGenerator and
the ExecPlan's finished future, and this future is marked done immediately
if either of them is marked finished. What can happen in the case of an
error is that the sink node will mark its output as finished with an error
status, which will cause the StartAndCollect future to be marked finished.
This races with the plan's finished future. If do StartAndCollect and
immediately exit the scope, the ExecPlan's finished future may still be
unfinished, which will trigger the assert. To avoid this problem, always
wait on the plan's finished future before exiting the scope (see [1]).

I just landed a PR[2] today that simplifies a lot of this stuff, and I have
a followup PR in the works that'll further simplify this whole thing. This
followup PR will help fix the error propagation and also fix this bug with
StartAndCollect.

Sasha

[1]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/exec/hash_join_node_test.cc#L1747
[2] https://github.com/apache/arrow/pull/13143


On Fri, Jul 15, 2022 at 12:02 AM Yaron Gvili <rt...@hotmail.com> wrote:

> I ran into similar issues where a bug in a node's code led to an error
> that caused difficult-to-debug hangs or crashes during execution. I think a
> common problem with diagnosing such issues is that error messages (within
> Status instances) during execution do not always get communicated. Perhaps
> it would be useful to add a kind of debug-flag in Acero that causes these
> error messages during execution to at least be printed.
>
>
> Yaron.
> ________________________________
> From: Weston Pace <weston.p...@gmail.com>
> Sent: Thursday, July 14, 2022 12:38 PM
> To: dev@arrow.apache.org <dev@arrow.apache.org>
> Subject: Re: cpp: Debugging 'plan destruction before finishing'
>
> > After some quick debugging, I found that the asof node's StopProducing (a
> conditioning necessary to finish the plan) is called shortly after the
> error output.
>
> StopProducing should probably more accurately be named "Abort" or
> "StopRightNow".  If you run the plan to completion normally I do not
> believe you should see this getting called.
>
> > What cases would cause the plan to destruct before its nodes finish?
>
> This may be a chicken/egg problem but destroying a plan before it has
> finished will cause this (the destructor panics and, in a probably
> futile attempt, calls StopProducing in hopes it can stop the ongoing
> plan before a segmentation fault since any ongoing task is going to
> assume the plan is still alive and valid).
>
> StartAndCollect returns a future.  Are you sure you are keeping the
> exec plan alive / in scope until that future completes?  Can you share
> the code that is calling StartAndCollect?
>
> An error that is unhandled and reaches a sink (since there are no
> nodes that "handle errors" today this means any error) will also
> trigger a call to StopProducing.  So if the AsofJoinNode is calling
> ErrorReceived on its output then that would be a potential cause.  You
> can probably check for this condition with a debugger.
>
> On Thu, Jul 14, 2022 at 9:07 AM Ivan Chau <ivan.m.c...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I've been encountering a "plan destruction before finishing" output
> > occurring with the AsOfJoin node, particularly when joining large tables.
> >
> > My execution context is configured with the default memory pool and a
> > nullptr for the executor. I am calling StartAndCollect
> > <
> https://github.com/apache/arrow/blob/6cc37cf2d1ba72c46b64fbc7ac499bd0d7296d20/cpp/src/arrow/compute/exec/test_util.cc#L183-L197
> >
> > to execute the plan.
> >
> > After some quick debugging, I found that the asof node's StopProducing (a
> > conditioning necessary to finish the plan) is called shortly after the
> > error output.
> >
> > What cases would cause the plan to destruct before its nodes finish?
>

Reply via email to