[ https://issues.apache.org/jira/browse/DRILL-4057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003443#comment-15003443 ]
Jacques Nadeau commented on DRILL-4057: --------------------------------------- Note, I deleted the attachment file here and moved the workaround to DRILL-4085. Keeping this open until actual concurrency bug is resolved. > Sporadic IllegalReferenceCountException in RPC layer causes query to hang > ------------------------------------------------------------------------- > > Key: DRILL-4057 > URL: https://issues.apache.org/jira/browse/DRILL-4057 > Project: Apache Drill > Issue Type: Bug > Reporter: Sudheesh Katkam > Priority: Critical > > In a 10 node cluster, running TPCH queries one at a time, sporadically there > is a "IllegalReferenceCountException" in the logs causing a query to hang. > In one of the logs (node 8): > {code} > 2015-11-03 15:11:01,611 ucs-node8.perf.lab [DATA-rpc-event-queue] ERROR > o.a.drill.exec.rpc.data.DataServer - Failure while handling message. > io.netty.util.IllegalReferenceCountException: refCnt: 0 > at > io.netty.buffer.AbstractByteBuf.ensureAccessible(AbstractByteBuf.java:1178) > ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.buffer.AbstractByteBuf.checkReadableBytes(AbstractByteBuf.java:1161) > ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:676) > ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.buffer.ByteBufInputStream.read(ByteBufInputStream.java:120) > ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final] > at java.io.InputStream.read(InputStream.java:101) ~[na:1.7.0_65] > at > com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) > ~[protobuf-java-2.5.0.jar:na] > at > com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) > ~[protobuf-java-2.5.0.jar:na] > at > com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) > ~[protobuf-java-2.5.0.jar:na] > at > org.apache.drill.exec.proto.BitData$FragmentRecordBatch.<init>(BitData.java:1152) > ~[drill-protocol-1.3.0-SNAPSHOT.jar:1.3.0-SNAPSHOT] > at > org.apache.drill.exec.proto.BitData$FragmentRecordBatch.<init>(BitData.java:1116) > ~[drill-protocol-1.3.0-SNAPSHOT.jar:1.3.0-SNAPSHOT] > at > org.apache.drill.exec.proto.BitData$FragmentRecordBatch$1.parsePartialFrom(BitData.java:1269) > ~[drill-protocol-1.3.0-SNAPSHOT.jar:1.3.0-SNAPSHOT] > at > org.apache.drill.exec.proto.BitData$FragmentRecordBatch$1.parsePartialFrom(BitData.java:1264) > ~[drill-protocol-1.3.0-SNAPSHOT.jar:1.3.0-SNAPSHOT] > at > com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200) > ~[protobuf-java-2.5.0.jar:na] > at > com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217) > ~[protobuf-java-2.5.0.jar:na] > at > com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223) > ~[protobuf-java-2.5.0.jar:na] > at > com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) > ~[protobuf-java-2.5.0.jar:na] > at org.apache.drill.exec.rpc.RpcBus.get(RpcBus.java:332) > ~[drill-java-exec-1.3.0-SNAPSHOT.jar:1.3.0-SNAPSHOT] > at > org.apache.drill.exec.rpc.data.DataServer.handle(DataServer.java:124) > ~[drill-java-exec-1.3.0-SNAPSHOT.jar:1.3.0-SNAPSHOT] > at > org.apache.drill.exec.rpc.data.DataServer.handle(DataServer.java:51) > ~[drill-java-exec-1.3.0-SNAPSHOT.jar:1.3.0-SNAPSHOT] > at org.apache.drill.exec.rpc.RpcBus$RequestEvent.run(RpcBus.java:390) > ~[drill-java-exec-1.3.0-SNAPSHOT.jar:1.3.0-SNAPSHOT] > at > org.apache.drill.common.SerializedExecutor$RunnableProcessor.run(SerializedExecutor.java:105) > [drill-common-1.3.0-SNAPSHOT.jar:1.3.0-SNAPSHOT] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [na:1.7.0_65] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_65] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65] > ... > 2015-11-03 15:11:01,612 ucs-node8.perf.lab [DATA-rpc-event-queue] ERROR > o.a.drill.exec.rpc.data.DataServer - Failure while running rpc command. > java.lang.IllegalArgumentException: Buffer has negative reference count. > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:92) > ~[guava-14.0.1.jar:na] > at io.netty.buffer.DrillBuf.release(DrillBuf.java:250) > ~[drill-java-exec-1.3.0-SNAPSHOT.jar:4.0.27.Final] > at io.netty.buffer.DrillBuf.release(DrillBuf.java:259) > ~[drill-java-exec-1.3.0-SNAPSHOT.jar:4.0.27.Final] > at io.netty.buffer.DrillBuf.release(DrillBuf.java:239) > ~[drill-java-exec-1.3.0-SNAPSHOT.jar:4.0.27.Final] > at org.apache.drill.exec.rpc.RpcBus$RequestEvent.run(RpcBus.java:397) > ~[drill-java-exec-1.3.0-SNAPSHOT.jar:1.3.0-SNAPSHOT] > at > org.apache.drill.common.SerializedExecutor$RunnableProcessor.run(SerializedExecutor.java:105) > ~[drill-common-1.3.0-SNAPSHOT.jar:1.3.0-SNAPSHOT] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [na:1.7.0_65] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_65] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65] > {code} > On one of the other nodes (node 9), there is a thread that is waiting: > {code} > "29c6c10e-642f-c605-cd73-f466ddd986d4:frag:1:216" daemon prio=10 > tid=0x00007fc2d151f800 nid=0x5edf waiting on condition [0x00007fc273f36000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0000000715db0ba8> (a > java.util.concurrent.Semaphore$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) > at java.util.concurrent.Semaphore.acquire(Semaphore.java:472) > at > org.apache.drill.exec.ops.SendingAccountor.waitForSendComplete(SendingAccountor.java:48) > - locked <0x0000000715db0b70> (a > org.apache.drill.exec.ops.SendingAccountor) > at > org.apache.drill.exec.ops.FragmentContext.waitForSendComplete(FragmentContext.java:441) > at > org.apache.drill.exec.physical.impl.BaseRootExec.close(BaseRootExec.java:133) > at > org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources(FragmentExecutor.java:339) > at > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:179) > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290) > at > org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)