[ 
https://issues.apache.org/jira/browse/DRILL-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197476#comment-16197476
 ] 

Robert Hou commented on DRILL-5840:
-----------------------------------

The cluster ran out of disk space and caused problems for Zookeeper.

>From /opt/mapr/zookeeper/zookeeper-3.4.5/logs

2017-10-04 16:22:54,475 [myid:] - ERROR [SyncThread:0:SyncRequestProcessor@151] 
- Severe unrecoverable error, exiting
java.io.IOException: No space left on device
        at javStopping zookeeper ... STOPPED
Starting zookeeper ... 2017-10-05 15:07:54,495 [myid:] - INFO  
[main:QuorumPeerConfig@101] - Reading configuration from: 
/opt/mapr/zookeeper/zookeeper-3.4.5/conf/zoo.cfg


> A query that includes sort completes, and then loses Drill connection. Drill 
> becomes unresponsive, and cannot restart because it cannot communicate with 
> Zookeeper
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5840
>                 URL: https://issues.apache.org/jira/browse/DRILL-5840
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.11.0
>            Reporter: Robert Hou
>            Assignee: Paul Rogers
>             Fix For: 1.12.0
>
>
> Query is:
> {noformat}
> ALTER SESSION SET `exec.sort.disable_managed` = false;
> select count(*) from (select * from 
> dfs.`/drill/testdata/resource-manager/250wide.tbl` order by columns[0])d 
> where d.columns[0] = 'ljdfhwuehnoiueyf';
> {noformat}
> Query tries to complete, but cannot.  It takes 20 hours from the time the 
> query tries to complete, to the time Drill finally loses its connection.
> From the drillbit.log:
> {noformat}
> 2017-10-03 16:28:14,892 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG 
> o.a.drill.exec.work.foreman.Foreman - 262bec7f-3539-0dd7-6fea-f2959f9df3b6: 
> State change requested RUNNING --> COMPLETED
> 2017-10-04 01:47:27,698 [UserServer-1] DEBUG 
> o.a.d.e.r.u.UserServerRequestHandler - Received query to run.  Returning 
> query handle.
> 2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] WARN  
> o.a.d.exec.work.foreman.QueryManager - Failure while trying to delete the 
> estore profile for this query.
> org.apache.drill.common.exceptions.DrillRuntimeException: unable to delete 
> node at /running/262bec7f-3539-0dd7-6fea-f2959f9df3b6
>       at 
> org.apache.drill.exec.coord.zk.ZookeeperClient.delete(ZookeeperClient.java:343)
>  ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.coord.zk.ZkEphemeralStore.remove(ZkEphemeralStore.java:108)
>  ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.QueryManager.updateEphemeralState(QueryManager.java:293)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.Foreman.recordNewState(Foreman.java:1043) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:964) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.Foreman.access$2600(Foreman.java:113) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:1025)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:1018)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.common.EventProcessor.processEvents(EventProcessor.java:107) 
> [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.common.EventProcessor.sendEvent(EventProcessor.java:65) 
> [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.Foreman$StateSwitch.addEvent(Foreman.java:1020)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.Foreman.addToEventQueue(Foreman.java:1038) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.QueryManager.nodeComplete(QueryManager.java:498)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.QueryManager.access$100(QueryManager.java:66)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.QueryManager$NodeTracker.fragmentComplete(QueryManager.java:462)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.QueryManager.fragmentDone(QueryManager.java:147)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.QueryManager.access$400(QueryManager.java:66)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.QueryManager$1.statusUpdate(QueryManager.java:525)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.rpc.control.WorkEventBus.statusUpdate(WorkEventBus.java:71)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.fragment.FragmentStatusReporter.sendStatus(FragmentStatusReporter.java:124)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.fragment.FragmentStatusReporter.stateChanged(FragmentStatusReporter.java:94)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:304)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:267)
>  [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_111]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_111]
>       at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
> Caused by: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
>       at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
> ~[curator-client-2.7.1.jar:na]
>       at 
> org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) 
> ~[curator-client-2.7.1.jar:na]
>       at 
> org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
>  ~[curator-client-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477)
>  ~[curator-framework-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:238)
>  ~[curator-framework-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:233)
>  ~[curator-framework-2.7.1.jar:na]
>       at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) 
> ~[curator-client-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230)
>  ~[curator-framework-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:214)
>  ~[curator-framework-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:41)
>  ~[curator-framework-2.7.1.jar:na]
>       at 
> org.apache.drill.exec.coord.zk.ZookeeperClient.delete(ZookeeperClient.java:340)
>  ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       ... 27 common frames omitted
> 2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG 
> o.a.drill.exec.work.foreman.Foreman - 262bec7f-3539-0dd7-6fea-f2959f9df3b6: 
> cleaning up.
> 2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG 
> o.a.d.exec.rpc.control.WorkEventBus - Removing fragment status listener for 
> queryId 262bec7f-3539-0dd7-6fea-f2959f9df3b6.
> 2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG 
> o.a.drill.exec.work.foreman.Foreman - 262bec7f-3539-0dd7-6fea-f2959f9df3b6: 
> State change requested COMPLETED --> CANCELLATION_REQUESTED
> 2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] WARN  
> o.a.drill.exec.work.foreman.Foreman - Dropping request to move to 
> CANCELLATION_REQUESTED state as query is already at COMPLETED state (which is 
> terminal).
> 2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG 
> o.a.d.e.w.f.FragmentStatusReporter - Closing 
> org.apache.drill.exec.work.fragment.FragmentStatusReporter@75e9a886
> 2017-10-04 03:30:02,970 [drill-executor-42] DEBUG 
> o.a.d.exec.rpc.control.WorkEventBus - Removing fragment manager: 
> 262bec7f-3539-0dd7-6fea-f2959f9df3b6:0:0
> 2017-10-04 12:49:47,660 [UserServer-1] WARN  
> o.a.drill.exec.rpc.user.UserServer - Message of mode REQUEST of rpc type 3 
> took longer than 500ms.  Actual duration was 39739962ms.
> 2017-10-04 12:49:47,662 [UserServer-1] ERROR 
> o.a.d.exec.rpc.RpcExceptionHandler - Exception in RPC communication.  
> Connection: /10.10.100.190:31010 <--> /10.10.100.190:37568 (user server).  
> Closing connection.
> io.netty.handler.codec.DecoderException: 
> org.apache.drill.common.exceptions.DrillRuntimeException: unable to put 
>       at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99)
>  [netty-codec-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:150)
>  [netty-handler-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  [netty-codec-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
>  [netty-codec-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>  [netty-common-4.0.27.Final.jar:4.0.27.Final]
>       at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
> Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: unable 
> to put 
>       at 
> org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:300) 
> ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:246) 
> ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.coord.zk.ZkEphemeralStore.put(ZkEphemeralStore.java:79) 
> ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.QueryManager.updateEphemeralState(QueryManager.java:285)
>  ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.foreman.Foreman.recordNewState(Foreman.java:1043) 
> ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at org.apache.drill.exec.work.foreman.Foreman.<init>(Foreman.java:180) 
> ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.work.user.UserWorker.submitWork(UserWorker.java:78) 
> ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.rpc.user.UserServerRequestHandler.handle(UserServerRequestHandler.java:66)
>  ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.rpc.user.UserServerRequestHandler.handle(UserServerRequestHandler.java:47)
>  ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:157) 
> ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:53) 
> ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:274) 
> ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:244) 
> ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
>  [netty-codec-4.0.27.Final.jar:4.0.27.Final]
>       ... 22 common frames omitted
> Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: error 
> while checking path on zookeeper
>       at 
> org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:139)
>  ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:107)
>  ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:273) 
> ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       ... 35 common frames omitted
> Caused by: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
>       at 
> org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
> ~[curator-client-2.7.1.jar:na]
>       at 
> org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) 
> ~[curator-client-2.7.1.jar:na]
>       at 
> org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
>  ~[curator-client-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477)
>  ~[curator-framework-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:172)
>  ~[curator-framework-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:161)
>  ~[curator-framework-2.7.1.jar:na]
>       at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) 
> ~[curator-client-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:158)
>  ~[curator-framework-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:148)
>  ~[curator-framework-2.7.1.jar:na]
>       at 
> org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:36)
>  ~[curator-framework-2.7.1.jar:na]
>       at 
> org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:130)
>  ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       ... 37 common frames omitted
> 2017-10-04 12:49:47,667 [UserServer-1] INFO  
> o.a.drill.exec.rpc.user.UserServer - RPC connection /10.10.100.190:31010 <--> 
> /10.10.100.190:37579 (user server) timed out.  Timeout was set to 30 seconds. 
> Closing connection.
> 2017-10-04 12:49:47,668 [UserServer-1] INFO  
> o.a.drill.exec.work.foreman.Foreman - Failure while trying communicate query 
> result to initiating client. This would happen if a client is disconnected 
> before response notice can be sent.
> org.apache.drill.exec.rpc.ChannelClosedException: Channel closed 
> /10.10.100.190:31010 <--> /10.10.100.190:37579.
>       at 
> org.apache.drill.exec.rpc.RpcBus$ChannelClosedHandler.operationComplete(RpcBus.java:164)
>  [drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> org.apache.drill.exec.rpc.RpcBus$ChannelClosedHandler.operationComplete(RpcBus.java:144)
>  [drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
>  [netty-common-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
>  [netty-common-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
>  [netty-common-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:406) 
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:943)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:592)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:584)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1099)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:615)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:600)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.ChannelOutboundHandlerAdapter.close(ChannelOutboundHandlerAdapter.java:71)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:615)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:600)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:466)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.handler.timeout.ReadTimeoutHandler.readTimedOut(ReadTimeoutHandler.java:187)
>  [netty-handler-4.0.27.Final.jar:4.0.27.Final]
>       at 
> org.apache.drill.exec.rpc.BasicServer$LoggingReadTimeoutHandler.readTimedOut(BasicServer.java:122)
>  [drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>       at 
> io.netty.handler.timeout.ReadTimeoutHandler$ReadTimeoutTask.run(ReadTimeoutHandler.java:212)
>  [netty-handler-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>  [netty-common-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
>  [netty-common-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>  [netty-common-4.0.27.Final.jar:4.0.27.Final]
>       at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) 
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>       at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>  [netty-common-4.0.27.Final.jar:4.0.27.Final]
>       at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
> {noformat}
> When I re-connect to Drill and try to show schemas, Drill hangs, and I see 
> this in drillbit.log
> {noformat}
> 2017-10-04 13:01:01,414 [UserServer-1] DEBUG 
> o.a.d.e.r.u.UserServerRequestHandler - Received query to run.  Returning 
> query handle.
> {noformat}
> I try to stop Drill, but it doesn't stop, so it gets killed.
> {noformat}
> [root@qa-node190 logs.67]# /opt/drill/bin/drillbit.sh stop
> Stopping drillbit
> .........................................................................................................................
> drillbit did not complete after 120 seconds, killing with kill -9 14550
> {noformat}
> I try to start Drill
> {noformat}
> /opt/drill/bin/drillbit.sh start
> {noformat}
> But it fails because it cannot communicate with Zookeeper:
> {noformat}
> 2017-10-04 14:40:32,812 [main] DEBUG o.a.d.e.c.zk.ZKClusterCoordinator - 
> Starting ZKClusterCoordination.
> 2017-10-04 14:40:42,820 [main] ERROR o.apache.drill.exec.server.Drillbit - 
> Failure during initial startup of Drillbit.
> java.io.IOException: Failure to connect to the zookeeper cluster service 
> within the allotted time of 10000 milliseconds.
>         at 
> org.apache.drill.exec.coord.zk.ZKClusterCoordinator.start(ZKClusterCoordinator.java:123)
>  ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:153) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:343) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:316) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:312) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 2017-10-04 14:40:42,820 [main] DEBUG o.apache.drill.exec.server.Drillbit - 
> Shutdown begun.
> 2017-10-04 14:40:44,714 [main] WARN  o.apache.drill.exec.server.Drillbit - 
> Failure on close()
> java.lang.NullPointerException: null
>         at org.apache.drill.exec.work.WorkManager.close(WorkManager.java:152) 
> ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at 
> org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76) 
> ~[drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at 
> org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64) 
> ~[drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:199) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:346) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:316) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:312) 
> [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 2017-10-04 14:40:44,715 [main] INFO  o.apache.drill.exec.server.Drillbit - 
> Shutdown completed (1894 ms).
> {noformat}
> I normally restart warden and zookeeper to resolve this.  When I stopped 
> warden, it took 20 minutes (normally it should take a few minutes or less):
> {noformat}
> [root@qa-node190 logs.67]# time service mapr-warden stop
> stopping WARDEN
> looking to stop mapr-core processes not started by warden
> real  20m46.637s
> user  0m1.622s
> sys   0m2.432s
> {noformat}
> Stopping zookeeper was quick, and restarting all processes including Drill 
> completed as expected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to