[GitHub] [arrow] westonpace opened a new pull request, #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

GitBox Thu, 14 Apr 2022 17:48:55 -0700


westonpace opened a new pull request, #12894:
URL: https://github.com/apache/arrow/pull/12894


   I identified and reproduced two possible ways this sort of segmentation 
fault could happen.  The stack traces demonstrated that worker tasks were still 
running for a plan after the test case had considered the plan "finished" and 
moved on.
   
   First, the test case was calling gtest asserts in a helper method called 
from a loop:
   
   ```
   void RunPlan(parameters) {
     Plan plan = MakePlan(parameters);
     ASSERT_TRUE(plan.FinishesInAReasonableTime());
   }
   void Test() {
     // ...
     for (int i = 0; i < kNumTrials; i++) {
       RunPlan(parameters);
     }
   }
   ```
   
   If the plan was sometimes timing out then the assert could be triggered.  A 
gtest assert simply returns immediately but it would then get swept up into the 
next iteration of the loop.  I changed the helper method to return a `Result` 
and put all asserts in the test case.  That being said, I don't think this was 
the likely failure as I would expect to have seen instances of this test case 
timing out along with instances where it had a segmentation fault.
   
   The second possibility was a rather unique set of circumstances that I was 
only able to trigger reliably when inserting sleeps into the test at just the 
right spots.
   
   Basically, the node has three task groups, `BuildHashTable`, 
`ProbeQueuedBatches`, and `ScanHashTable`.  It is possible for 
`ProbeQueuedBatches` to have zero tasks.  This means, when `StartTaskGroup` is 
called on the probe task group it will immediately finish and call the finish 
continuation.  The finish continuation could then call `StartTaskGroup` on the 
scan hash table task.  If the scan hash table task finished quickly then it is 
possible it would trigger the finished callback of the exec node before the 
call to `StartTaskGroup->OnTaskGroupFinished` for the *probe* task group 
finishes returning.  This particular call returned 
`all_task_groups_finished=false` because it was the *probe* task group and the 
final task group was the scan task group.  As a result it would try and call 
`this->ScheduleMore` (still inside `StartTaskGroup`) but by this point `this` 
was deleted.  Actually, given the stack traces we have, it looks like the call 
to `ScheduleMore` started, which makes sens
 e as it wasn't a virtual call, but the state of `this` was invalid).
   
   I spent some time trying to figure out how to fix `TaskScheduler` when I 
realized we already have a convenient fix for this problem.  I added an 
`AsyncTaskGroup` at the node level to ensure that all thread tasks started by 
the node finish before the node is marked finished.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace opened a new pull request, #12894: ARROW-14911: [C++] arrow-compute-hash-join-node-test failed

Reply via email to