[
https://issues.apache.org/jira/browse/DRILL-5513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054729#comment-16054729
]
Paul Rogers commented on DRILL-5513:
------------------------------------
See [this
post|https://github.com/paul-rogers/drill/wiki/Drill-Spill-File-Format] for an
explanation of the memory issue with the spill file format.
The existing sort code in {{PriorityQueueCopierTemplate}} loads a fixed number
of record batches, which requires that the code be able to predict the memory
use of those batches. The prior discussion shows that, in general, the code
cannot correctly predict the memory size, due to the change in vector storage
format.
The solution, then, is to modify {{PriorityQueueCopierTemplate}} to load up to
a given number of batches, *or* up to a given memory limit.
However, in the worst case, a failure will occur if memory does not allow at
least two runs to be loaded.
> Managed External Sort : OOM error during the merge phase
> --------------------------------------------------------
>
> Key: DRILL-5513
> URL: https://issues.apache.org/jira/browse/DRILL-5513
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.10.0
> Reporter: Rahul Challapalli
> Assignee: Paul Rogers
> Attachments: 26e5f7b8-71e8-afca-e72e-fad7be2b2416.sys.drill,
> drillbit.log
>
>
> git.commit.id.abbrev=1e0a14c
> No of nodes in cluster : 1
> DRILL_MAX_DIRECT_MEMORY="32G"
> DRILL_MAX_HEAP="4G"
> The below query fails with an OOM
> {code}
> ALTER SESSION SET `exec.sort.disable_managed` = false;
> alter session set `planner.width.max_per_query` = 100;
> alter session set `planner.memory.max_query_memory_per_node` = 652428800;
> select count(*) from (select s1.type type, flatten(s1.rms.rptd) rptds from
> (select d.type type, d.uid uid, flatten(d.map.rm) rms from
> dfs.`/drill/testdata/resource-manager/nested-large.json` d order by d.uid) s1
> order by s1.rms.mapid);
> {code}
> Exception from the logs
> {code}
> 2017-05-15 12:58:46,646 [BitServer-4] DEBUG
> o.a.drill.exec.work.foreman.Foreman - 26e5f7b8-71e8-afca-e72e-fad7be2b2416:
> State change requested RUNNING --> FAILED
> org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR: One
> or more nodes ran out of memory while executing the query.
> Unable to allocate buffer of size 2097152 due to memory limit. Current
> allocation: 19791880
> Fragment 5:2
> [Error Id: bb67176f-a780-400d-88c9-06fea131ea64 on qa-node190.qa.lab:31010]
> (org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
> buffer of size 2097152 due to memory limit. Current allocation: 19791880
> org.apache.drill.exec.memory.BaseAllocator.buffer():220
> org.apache.drill.exec.memory.BaseAllocator.buffer():195
> org.apache.drill.exec.vector.BigIntVector.reAlloc():212
> org.apache.drill.exec.vector.BigIntVector.copyFromSafe():324
> org.apache.drill.exec.vector.NullableBigIntVector.copyFromSafe():367
>
> org.apache.drill.exec.vector.NullableBigIntVector$TransferImpl.copyValueSafe():328
>
> org.apache.drill.exec.vector.complex.RepeatedMapVector$RepeatedMapTransferPair.copyValueSafe():360
>
> org.apache.drill.exec.vector.complex.MapVector$MapTransferPair.copyValueSafe():220
> org.apache.drill.exec.vector.complex.MapVector.copyFromSafe():82
>
> org.apache.drill.exec.test.generated.PriorityQueueCopierGen4494.doCopy():34
> org.apache.drill.exec.test.generated.PriorityQueueCopierGen4494.next():76
>
> org.apache.drill.exec.physical.impl.xsort.managed.CopierHolder$BatchMerger.next():234
>
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeSpilledRuns():1214
>
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load():689
>
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext():559
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():215
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():215
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():234
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():227
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():415
> org.apache.hadoop.security.UserGroupInformation.doAs():1595
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():227
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1145
> java.util.concurrent.ThreadPoolExecutor$Worker.run():615
> java.lang.Thread.run():745
> at
> org.apache.drill.exec.work.foreman.QueryManager$1.statusUpdate(QueryManager.java:537)
> [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
> at
> org.apache.drill.exec.rpc.control.WorkEventBus.statusUpdate(WorkEventBus.java:71)
> [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
> at
> org.apache.drill.exec.work.batch.ControlMessageHandler.handle(ControlMessageHandler.java:94)
> [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
> at
> org.apache.drill.exec.work.batch.ControlMessageHandler.handle(ControlMessageHandler.java:55)
> [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
> at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:159)
> [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
> at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:53)
> [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
> at
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:274)
> [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
> at
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:244)
> [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
> at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
> [netty-codec-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:150)
> [netty-handler-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> [netty-codec-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
> [netty-codec-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
> {code}
> Attached the log and profile files. The dataset is not attached here as it is
> larger than the permitted size
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)