[
https://issues.apache.org/jira/browse/DRILL-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232891#comment-15232891
]
Sudheesh Katkam commented on DRILL-4595:
----------------------------------------
\[copying my comment from dev list thread\].
In any scenario where a thread other than the fragment executor is failing (or
cancelling) a fragment, that thread should change the state, and then interrupt
the fragment executor.
> FragmentExecutor.fail() should interrupt the fragment thread to avoid
> possible query hangs
> ------------------------------------------------------------------------------------------
>
> Key: DRILL-4595
> URL: https://issues.apache.org/jira/browse/DRILL-4595
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.4.0
> Reporter: Deneche A. Hakim
> Assignee: Deneche A. Hakim
> Fix For: 1.7.0
>
>
> When a fragment fails it's assumed it will be able to close itself and send
> it's FAILED state to the foreman which will cancel any running fragments.
> FragmentExecutor.cancel() will interrupt the thread making sure those
> fragment don't stay blocked.
> However, if a fragment is already blocked when it's fail method is called the
> foreman may never be notified about this and the query will hang forever. One
> such scenario is the following:
> - generally it's a CTAS running on a large cluster (lot's of writers running
> in parallel)
> - logs show that the user channel was closed and UserServer caused the root
> fragment to move to a FAILED state
> - jstack shows that the root fragment is blocked in it's receiver waiting for
> data
> - jstack also shows that ALL other fragments are no longer running, and the
> logs show that all of them succeeded
> - the foreman waits *forever* for the root fragment to finish
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)