westonpace opened a new pull request, #12894:
URL: https://github.com/apache/arrow/pull/12894
I identified and reproduced two possible ways this sort of segmentation
fault could happen. The stack traces demonstrated that worker tasks were still
running for a plan after the test case had considered the plan "finished" and
moved on.
First, the test case was calling gtest asserts in a helper method called
from a loop:
```
void RunPlan(parameters) {
Plan plan = MakePlan(parameters);
ASSERT_TRUE(plan.FinishesInAReasonableTime());
}
void Test() {
// ...
for (int i = 0; i < kNumTrials; i++) {
RunPlan(parameters);
}
}
```
If the plan was sometimes timing out then the assert could be triggered. A
gtest assert simply returns immediately but it would then get swept up into the
next iteration of the loop. I changed the helper method to return a `Result`
and put all asserts in the test case. That being said, I don't think this was
the likely failure as I would expect to have seen instances of this test case
timing out along with instances where it had a segmentation fault.
The second possibility was a rather unique set of circumstances that I was
only able to trigger reliably when inserting sleeps into the test at just the
right spots.
Basically, the node has three task groups, `BuildHashTable`,
`ProbeQueuedBatches`, and `ScanHashTable`. It is possible for
`ProbeQueuedBatches` to have zero tasks. This means, when `StartTaskGroup` is
called on the probe task group it will immediately finish and call the finish
continuation. The finish continuation could then call `StartTaskGroup` on the
scan hash table task. If the scan hash table task finished quickly then it is
possible it would trigger the finished callback of the exec node before the
call to `StartTaskGroup->OnTaskGroupFinished` for the *probe* task group
finishes returning. This particular call returned
`all_task_groups_finished=false` because it was the *probe* task group and the
final task group was the scan task group. As a result it would try and call
`this->ScheduleMore` (still inside `StartTaskGroup`) but by this point `this`
was deleted. Actually, given the stack traces we have, it looks like the call
to `ScheduleMore` started, which makes sens
e as it wasn't a virtual call, but the state of `this` was invalid).
I spent some time trying to figure out how to fix `TaskScheduler` when I
realized we already have a convenient fix for this problem. I added an
`AsyncTaskGroup` at the node level to ensure that all thread tasks started by
the node finish before the node is marked finished.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]