[ 
https://issues.apache.org/jira/browse/DRILL-5513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053554#comment-16053554
 ] 

Paul Rogers commented on DRILL-5513:
------------------------------------

The problem now is that the amount of memory needed to read a spilled batch is 
significantly larger than the amount of memory the batch consumed when written. 
From the (revised) logs:

{code}
ExternalSortBatch - Merge batch size: net = 6666666 bytes,
   gross = 8888888 bytes, records = 35460;
   spill file size: 268435456 bytes
...
PriorityQueueCopierWrapper - Merged 35460 records,
   consuming 7536640 bytes of memory
...
BatchGroup - Read 35460 records; size = 9437184
{code}

The sort decided a spill batch should contain 6,666,666 bytes of data. It 
assumed vectors are 75% full (25% internal fragmentation) for a memory target 
of 8,888,888 allocated bytes.

Actual spill was close: actual allocated memory was 7,536,640 bytes.

But, then the assumptions fell apart on re-reading this same data. The read 
needed 9,437,184 bytes of memory. Since this is much more than was expected, we 
got whacked by the operator memory allocator killing us due to an excessive 
allocation. (Sigh... Is this really a constructive use of time?)

Furthermore, since we have only 20 MB of memory, rereading two spilled batches 
consumes 18,874,368 bytes, leaving only 1,125,632 to hold an output batch that 
is supposed to be 6,666,668 bytes in size (net) or 8,888,888 bytes gross.

Have to come up with some way to anticipate the unexpected excessive memory use 
and reduce the spill batch size accordingly. (Or, have to track down the reason 
for the excessive use and fix it...)

> Managed External Sort : OOM error during the merge phase
> --------------------------------------------------------
>
>                 Key: DRILL-5513
>                 URL: https://issues.apache.org/jira/browse/DRILL-5513
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Paul Rogers
>         Attachments: 26e5f7b8-71e8-afca-e72e-fad7be2b2416.sys.drill, 
> drillbit.log
>
>
> git.commit.id.abbrev=1e0a14c
> No of nodes in cluster : 1
> DRILL_MAX_DIRECT_MEMORY="32G"
> DRILL_MAX_HEAP="4G"
> The below query fails with an OOM
> {code}
> ALTER SESSION SET `exec.sort.disable_managed` = false;
> alter session set `planner.width.max_per_query` = 100;
> alter session set `planner.memory.max_query_memory_per_node` = 652428800;
> select count(*) from (select s1.type type, flatten(s1.rms.rptd) rptds from 
> (select d.type type, d.uid uid, flatten(d.map.rm) rms from 
> dfs.`/drill/testdata/resource-manager/nested-large.json` d order by d.uid) s1 
> order by s1.rms.mapid);
> {code}
> Exception from the logs
> {code}
> 2017-05-15 12:58:46,646 [BitServer-4] DEBUG 
> o.a.drill.exec.work.foreman.Foreman - 26e5f7b8-71e8-afca-e72e-fad7be2b2416: 
> State change requested RUNNING --> FAILED
> org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR: One 
> or more nodes ran out of memory while executing the query.
> Unable to allocate buffer of size 2097152 due to memory limit. Current 
> allocation: 19791880
> Fragment 5:2
> [Error Id: bb67176f-a780-400d-88c9-06fea131ea64 on qa-node190.qa.lab:31010]
>   (org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate 
> buffer of size 2097152 due to memory limit. Current allocation: 19791880
>     org.apache.drill.exec.memory.BaseAllocator.buffer():220
>     org.apache.drill.exec.memory.BaseAllocator.buffer():195
>     org.apache.drill.exec.vector.BigIntVector.reAlloc():212
>     org.apache.drill.exec.vector.BigIntVector.copyFromSafe():324
>     org.apache.drill.exec.vector.NullableBigIntVector.copyFromSafe():367
>     
> org.apache.drill.exec.vector.NullableBigIntVector$TransferImpl.copyValueSafe():328
>     
> org.apache.drill.exec.vector.complex.RepeatedMapVector$RepeatedMapTransferPair.copyValueSafe():360
>     
> org.apache.drill.exec.vector.complex.MapVector$MapTransferPair.copyValueSafe():220
>     org.apache.drill.exec.vector.complex.MapVector.copyFromSafe():82
>     
> org.apache.drill.exec.test.generated.PriorityQueueCopierGen4494.doCopy():34
>     org.apache.drill.exec.test.generated.PriorityQueueCopierGen4494.next():76
>     
> org.apache.drill.exec.physical.impl.xsort.managed.CopierHolder$BatchMerger.next():234
>     
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeSpilledRuns():1214
>     
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load():689
>     
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext():559
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():215
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>     
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():215
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>     
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():234
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():227
>     java.security.AccessController.doPrivileged():-2
>     javax.security.auth.Subject.doAs():415
>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():227
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1145
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():615
>     java.lang.Thread.run():745
>         at 
> org.apache.drill.exec.work.foreman.QueryManager$1.statusUpdate(QueryManager.java:537)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.rpc.control.WorkEventBus.statusUpdate(WorkEventBus.java:71)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.batch.ControlMessageHandler.handle(ControlMessageHandler.java:94)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.batch.ControlMessageHandler.handle(ControlMessageHandler.java:55)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:159) 
> [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:53) 
> [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:274) 
> [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:244) 
> [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
>  [netty-codec-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:150)
>  [netty-handler-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  [netty-codec-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
>  [netty-codec-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>  [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>  [netty-common-4.0.27.Final.jar:4.0.27.Final]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
> {code}
> Attached the log and profile files. The dataset is not attached here as it is 
> larger than the permitted size



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to