[
https://issues.apache.org/jira/browse/DRILL-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203770#comment-15203770
]
Deneche A. Hakim commented on DRILL-3714:
-----------------------------------------
Still investigating this issue. Here is what I've found so far:
- Drill runs out of Direct Memory in the RPC layer on the data server side
- this causes the data channel to be closed
- data client detects the connection was closed and fails the fragment that was
sending the batch
- because we pool the connections, all fragments that were sending data also
fail
- all failed fragments try to close their resources before reporting their
FAILED state to the Foreman
- their sending accountors report that some data batches are still waiting to
be sent and those fragments will block waiting for those batches to either
succeed or fail
- for some reason, those remaining batches never report back, the fragments are
blocked forever and the Foreman is never informed about this causing the query
to hang
Please note, that sometimes (1/10 queries) the Foreman does receive a FAILED
state and the query do fail, but even in this case I didn't check if some
fragments are left blocked.
> Query runs out of memory and remains in CANCELLATION_REQUESTED state until
> drillbit is restarted
> ------------------------------------------------------------------------------------------------
>
> Key: DRILL-3714
> URL: https://issues.apache.org/jira/browse/DRILL-3714
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow
> Affects Versions: 1.2.0
> Reporter: Victoria Markman
> Assignee: Deneche A. Hakim
> Priority: Critical
> Fix For: 1.7.0
>
> Attachments: Screen Shot 2015-08-26 at 10.36.33 AM.png, drillbit.log,
> jstack.txt, query_profile_2a2210a7-7a78-c774-d54c-c863d0b77bb0.json
>
>
> This is a variation of DRILL-3705 with the difference of drill behavior when
> hitting OOM condition.
> Query runs out of memory during execution and remains in
> "CANCELLATION_REQUESTED" state until drillbit is bounced.
> Client (sqlline in this case) never gets a response from the server.
> Reproduction details:
> Single node drillbit installation.
> DRILL_MAX_DIRECT_MEMORY="8G"
> DRILL_HEAP="4G"
> Run this query on TPCDS SF100 data set
> {code}
> SELECT SUM(ss.ss_net_paid_inc_tax) OVER (PARTITION BY ss.ss_store_sk) AS
> TotalSpend FROM store_sales ss WHERE ss.ss_store_sk IS NOT NULL ORDER BY 1
> LIMIT 10;
> {code}
> drillbit.log
> {code}
> 2015-08-26 16:54:58,469 [2a2210a7-7a78-c774-d54c-c863d0b77bb0:frag:3:22] INFO
> o.a.d.e.w.f.FragmentStatusReporter -
> 2a2210a7-7a78-c774-d54c-c863d0b77bb0:3:22: State to report: RUNNING
> 2015-08-26 16:55:50,498 [BitServer-5] WARN
> o.a.drill.exec.rpc.data.DataServer - Message of mode REQUEST of rpc type 3
> took longer than 500ms. Actual duration was 2569ms.
> 2015-08-26 16:56:31,086 [BitServer-5] ERROR
> o.a.d.exec.rpc.RpcExceptionHandler - Exception in RPC communication.
> Connection: /10.10.88.133:31012 <--> /10.10.88.133:54554 (data server).
> Closing connection.
> io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct
> buffer memory
> at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:233)
> ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:618)
> [netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
> at
> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329)
> [netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:250)
> [netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
> Caused by: java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_71]
> at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> ~[na:1.7.0_71]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
> ~[na:1.7.0_71]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:437)
> ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179)
> ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168)
> ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
> at io.netty.buffer.PoolArena.reallocate(PoolArena.java:280)
> ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
> at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:110)
> ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)
> ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)
> ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)
> ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)
> ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
> at io.netty.buffer.WrappedByteBuf.writeBytes(WrappedByteBuf.java:600)
> ~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.buffer.UnsafeDirectLittleEndian.writeBytes(UnsafeDirectLittleEndian.java:28)
> ~[drill-java-exec-1.2.0-SNAPSHOT.jar:4.0.27.Final]
> at
> io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:92)
> ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:227)
> ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
> ... 11 common frames omitted
> 2015-08-26 16:56:31,087 [BitServer-5] INFO
> o.a.d.exec.rpc.ProtobufLengthDecoder - Channel is closed, discarding
> remaining 124958 byte(s) in buffer.
> 2015-08-26 16:56:31,087 [BitClient-1] ERROR
> o.a.d.exec.rpc.RpcExceptionHandler - Exception in RPC communication.
> Connection: /10.10.88.133:54554 <--> /10.10.88.133:31012 (data client).
> Closing connection.
> java.io.IOException: syscall:read(...)() failed: Connection reset by peer
> 2015-08-26 16:56:31,088 [BitClient-1] INFO
> o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.133:54554 <-->
> /10.10.88.133:31012.
> 2015-08-26 16:56:35,325 [BitServer-6] ERROR
> o.a.d.exec.rpc.RpcExceptionHandler - Exception in RPC communication.
> Connection: /10.10.88.133:31012 <--> /10.10.88.133:54555 (data server).
> Closing connection.
> io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct
> buffer memory
> at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:233)
> ~[netty-codec-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:618)
> [netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
> at
> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:329)
> [netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:250)
> [netty-transport-native-epoll-4.0.27.Final-linux-x86_64.jar:na]
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
> {code}
> Attached:
> query_profile_2a2210a7-7a78-c774-d54c-c863d0b77bb0.json
> drillbit.log
> jstack.txt
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)