[
https://issues.apache.org/jira/browse/DRILL-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544569#comment-14544569
]
Chris Westin commented on DRILL-3030:
-------------------------------------
Foreman has noticed (probably from the FragmentStateListener) that the fragment
running on the node you killed is gone, so now it's trying to cancel all the
other fragments as part of its cleanup. I wonder if it's blocked on trying to
communicate with the dead node. I'll check to see if we exclude that one from
the list of ones to be cancelled. In any case, we should issue this call with a
timeout, because even if that isn't the case, any of the target nodes could go
down in between anyway. That's probably a better solution than trying to weed
out the failed fragment from the set of cancellations being sent.
> Foreman seems to be unable to cancel itself
> -------------------------------------------
>
> Key: DRILL-3030
> URL: https://issues.apache.org/jira/browse/DRILL-3030
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow
> Affects Versions: 1.0.0
> Reporter: Ramana Inukonda Nagaraj
> Assignee: Chris Westin
> Attachments: threadstack
>
>
> Steps to repro:
> 1. Ran long running query on a clean drill restart.
> 2. Killed a non foreman node.
> 3. Restarted drillbits using clush.
> One of the drillbits(coincidentally a foreman node always) refused to
> shutdown.
> Jstack shows that the foreman is waiting
> {code}
> at
> org.apache.drill.exec.rpc.ReconnectingConnection$ConnectionListeningFuture.waitAndRun(ReconnectingConnection.java:105)
> at
> org.apache.drill.exec.rpc.ReconnectingConnection.runCommand(ReconnectingConnection.java:81)
> - locked <0x000000073878aaa8> (a
> org.apache.drill.exec.rpc.control.ControlConnectionManager)
> at
> org.apache.drill.exec.rpc.control.ControlTunnel.cancelFragment(ControlTunnel.java:57)
> at
> org.apache.drill.exec.work.foreman.QueryManager.cancelExecutingFragments(QueryManager.java:192)
> at
> org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:824)
> at
> org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:768)
> at
> org.apache.drill.common.EventProcessor.sendEvent(EventProcessor.java:73)
> at
> org.apache.drill.exec.work.foreman.Foreman$StateSwitch.moveToState(Foreman.java:770)
> at
> org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:871)
> at
> org.apache.drill.exec.work.foreman.Foreman.access$2700(Foreman.java:107)
> at
> org.apache.drill.exec.work.foreman.Foreman$StateListener.moveToState(Foreman.java:1132)
> at
> org.apache.drill.exec.work.foreman.QueryManager$1.statusUpdate(QueryManager.java:460)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)