[ 
https://issues.apache.org/jira/browse/DRILL-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544569#comment-14544569
 ] 

Chris Westin commented on DRILL-3030:
-------------------------------------

Foreman has noticed (probably from the FragmentStateListener) that the fragment 
running on the node you killed is gone, so now it's trying to cancel all the 
other fragments as part of its cleanup. I wonder if it's blocked on trying to 
communicate with the dead node. I'll check to see if we exclude that one from 
the list of ones to be cancelled. In any case, we should issue this call with a 
timeout, because even if that isn't the case, any of the target nodes could go 
down in between anyway. That's probably a better solution than trying to weed 
out the failed fragment from the set of cancellations being sent.

> Foreman seems to be unable to cancel itself
> -------------------------------------------
>
>                 Key: DRILL-3030
>                 URL: https://issues.apache.org/jira/browse/DRILL-3030
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.0.0
>            Reporter: Ramana Inukonda Nagaraj
>            Assignee: Chris Westin
>         Attachments: threadstack
>
>
> Steps to repro:
> 1. Ran long running query on a clean drill restart. 
> 2. Killed a non foreman node. 
> 3. Restarted drillbits using clush.
> One of the drillbits(coincidentally a foreman node always) refused to 
> shutdown. 
> Jstack shows that the foreman is waiting 
> {code}
>   at 
> org.apache.drill.exec.rpc.ReconnectingConnection$ConnectionListeningFuture.waitAndRun(ReconnectingConnection.java:105)
>         at 
> org.apache.drill.exec.rpc.ReconnectingConnection.runCommand(ReconnectingConnection.java:81)
>         - locked <0x000000073878aaa8> (a 
> org.apache.drill.exec.rpc.control.ControlConnectionManager)
>         at 
> org.apache.drill.exec.rpc.control.ControlTunnel.cancelFragment(ControlTunnel.java:57)
>         at 
> org.apache.drill.exec.work.foreman.QueryManager.cancelExecutingFragments(QueryManager.java:192)
>         at 
> org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:824)
>         at 
> org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:768)
>         at 
> org.apache.drill.common.EventProcessor.sendEvent(EventProcessor.java:73)
>         at 
> org.apache.drill.exec.work.foreman.Foreman$StateSwitch.moveToState(Foreman.java:770)
>         at 
> org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:871)
>         at 
> org.apache.drill.exec.work.foreman.Foreman.access$2700(Foreman.java:107)
>         at 
> org.apache.drill.exec.work.foreman.Foreman$StateListener.moveToState(Foreman.java:1132)
>         at 
> org.apache.drill.exec.work.foreman.QueryManager$1.statusUpdate(QueryManager.java:460)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to