Deneche A. Hakim created DRILL-4595:
---------------------------------------

             Summary: FragmentExecutor.fail() should interrupt the fragment 
thread to avoid possible query hangs
                 Key: DRILL-4595
                 URL: https://issues.apache.org/jira/browse/DRILL-4595
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.4.0
            Reporter: Deneche A. Hakim
            Assignee: Deneche A. Hakim
             Fix For: 1.7.0


When a fragment fails it's assumed it will be able to close itself and send 
it's FAILED state to the foreman which will cancel any running fragments. 
FragmentExecutor.cancel() will interrupt the thread making sure those fragment 
don't stay blocked.
However, if a fragment is already blocked when it's fail method is called the 
foreman may never be notified about this and the query will hang forever. One 
such scenario is the following:

- generally it's a CTAS running on a large cluster (lot's of writers running in 
parallel)
- logs show that the user channel was closed and UserServer caused the root 
fragment to move to a FAILED state
- jstack shows that the root fragment is blocked in it's receiver waiting for 
data
- jstack also shows that ALL other fragments are no longer running, and the 
logs show that all of them succeeded
- the foreman waits *forever* for the root fragment to finish



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to