[
https://issues.apache.org/jira/browse/DRILL-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15995875#comment-15995875
]
Paul Rogers commented on DRILL-5470:
------------------------------------
E-mail thread from the dev list:
Nate's original post:
{quote}
We keep running into this issue when trying to issue a query with hashagg
disabled. When I look at system memory usage though, drill doesn't seem to
be using much of it but still hits this error.
Our environment:
- 1 r3.8xl
- 1 drillbit version 1.10.0 configured with 4GB of Heap and 230G of Direct
- Data stored on S3 is compressed CSV
I've tried increasing planner.memory.max_query_memory_per_node to 230G and
lowered planner.width.max_per_query to 1 and it still fails.
We've applied the patch from this bug in the hopes that it would resolve
the issue but it hasn't: DRILL-5226
Stack Trace:
{code}
(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
buffer of size 16777216 due to memory limit. Current allocation: 8445952
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.copyFromSafe():379
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen328.next():75
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.mergeAndSpill():602
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.innerNext():428
org.apache.drill.exec.record.AbstractRecordBatch.next():162
...
{code}
Is there something I'm missing here? Any help/direction would be
appreciated.
{quote}
Zelaine's response:
{quote}
The Jira you’ve referenced relates to the new external sort, which is not
enabled by default, as it is still going through some additional testing. If
you’d like to try it to see if it resolves your problem, you’ll need to
set “sort.external.disable_managed” as follows in your drill-override.conf
file:
{code}
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181",
sort.external.disable_managed: false
}
{code}
and run the following query:
{code}
ALTER SESSION SET `exec.sort.disable_managed` = false;
{code}
{quote}
>From Nate again:
{quote}
Zelaine, thanks for the suggestion. I added this option both to the
drill-override and in the session and this time the query did stay running
for much longer but it still eventually failed with the same error,
although much different memory values.
{code}
(org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate
buffer of size 134217728 due to memory limit. Current allocation:
10653214316
org.apache.drill.exec.memory.BaseAllocator.buffer():220
org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.reAlloc():425
org.apache.drill.exec.vector.VarCharVector.copyFromSafe():278
org.apache.drill.exec.vector.NullableVarCharVector.copyFromSafe():379
org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.doCopy():22
org.apache.drill.exec.test.generated.PriorityQueueCopierGen8.next():76
org.apache.drill.exec.physical.impl.xsort.managed.CopierHolder$BatchMerger.next():234
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill():1408
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill():1376
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory():1339
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.processBatch():831
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch():618
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load():660
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext():559
org.apache.drill.exec.record.AbstractRecordBatch.next():162
...
{code}
At first I didn't change planner.width.max_per_query and the default on a
32 core machine makes it 23. This query failed after 34 minutes. I then
tried setting planner.width.max_per_query=1 and this query also failed but
of course took took longer, about 2 hours. In both cases,
planner.memory.max_query_memory_per_node was set to 230G.
{quote}
My response, which the Apache e-mail daemon blocked:
{quote}
I’ll give you three separate suggestions. The first two build on the discussion
with Zelaine. The third gets at a separate problem that could be the root cause.
First, let’s discuss logging. When we hit a bug such as this, the logs are
incredibly useful to learn what is going on. Turn on debug logging. If you are
familiar with Java logging, then you only need to enable the debug level for
the org.apache.drill.exec.physical.impl.xsort.managed package. Then, look for
lines that say “ExternalSortBatch”.
You will see a number of entries early on that identify the amount of memory
available to the sort, the size of the incoming batches, and how we will slice
up memory. Please post those lines to your JIRA entry.
Then, later, you’ll see an entry for the OOM error. Review the preceding
entries to get a sense of where the sort was: was it still reading and spilling
data from upstream (the sort phase)? Or, had it gotten to the merge phase in
which we reread spilled data.
The log entries, while cryptic on first glance, make a bit more sense after you
scan through the full set. Post those lines with summary info.
Also, the query profile will tell you how much memory was actually used at the
time of the OOM. You can compare that with the “budget” explained in the log
file entry mentioned above.
Second, we can better define how Drill works with sort memory to help you
properly configure your setup.
Here is some background.
* Your system has some amount of memory. In your case, 230 GB.
* To allocate memory to the sort, Drill does not use the actual memory.
Instead, we use planner.memory.max_query_memory_per_node. (The idea is that you
set this value as, roughly, system memory / number of concurrent queries.)
* Drill divides up memory to compute per-sort memory as: query memory per node
/ no. of slices / no. of sorts in the query.
* In your system, the number of slices is 23, so each fragment gets 10 GB of
memory.
* If your query has a single sort, then each sort gets 10 GB of memory.
* However, memory per query is capped by the boot-time drill.memory.top.max
option. (See below) which defaults go 20 GB. Not an issue here, but is an issue
if the numbers above come out differently.
* When you changed planner.width.max_per_query, it has no effect on memory.
* You’d ideally change planner.width.max_per_node to 1 to run the query
single-threaded. But, due to the item above, no sort will get more than 20 GB
anyway.
See the [actual
code|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/util/MemoryAllocationUtilities.java]
for details.
Despite all this, the likely original 10 GB allocation should be plenty; the
sort is supposed to spill. How much it spills depends on your input data size.
When sorting, performance is affected by memory:
* If your data is smaller than sort memory, sorting happens in memory, and
performance is optimal.
* If your data is larger than memory, but smaller than 8x memory, you’ll get a
“single generation” spill/merge and performance should be no worse than 3x an
in-memory sort. (1 x is the original data read, then another 1x for spill and
the third 1x for read/merge.)
* If your data is larger than 8x memory, sorting will need multiple generations
of spill/merge/re-spill, and run-time will increase accordingly.
Some options:
* Set planner.width.max_per_node to 1 to run the query single-threaded. This
will use all memory for the single sort.
* But, we’ve got that pesky 20 GB global cap. So, change your
drill-override.conf file as follows:
drill.memory.top.max: 100000000000;
(Sorry for all the zeros. It is supposed to be 100 GB. We really should switch
to a better format to specify memory…) 100 GB seems plenty without going larger.
You can verify that these changes take effect by looking for the log line that
explains the managed sort’s memory calculations (when debug logging is enabled.)
Third, all that said, I wonder if the problem is elsewhere. Yes, you are
getting an Out of Memory (OOM) error. But, not in the usual place that
indicates a sort issue. Instead, you are getting it in the allocation of a
“value vector.” This raises some questions:
* How big is your input data (size on disk)?
* How many columns?
* How wide are your VarChar columns, on average?
You mentioned data is compressed CSV. With typical 8x compression, actual data
sorted will be ~8x your on-disk size.
The column width question is critical. I see that the vector is trying to
allocate 16 MB of data, which is unusual.
{quote}
> External Sort - Unable to Allocate Buffer error
> -----------------------------------------------
>
> Key: DRILL-5470
> URL: https://issues.apache.org/jira/browse/DRILL-5470
> Project: Apache Drill
> Issue Type: Bug
> Components: Server
> Affects Versions: 1.10.0
> Environment: - ubuntu 14.04
> - r3.8xl (32 CPU/240GB Mem)
> - openjdk version "1.8.0_111"
> - drill 1.10.0 with 8656c83b00f8ab09fb6817e4e9943b2211772541 cherry-picked
> Reporter: Nathan Butler
> Assignee: Paul Rogers
>
> Per the mailing list discussion and Rahul's and Paul's suggestion I'm filing
> this Jira issue. Drill seems to be running out of memory when doing an
> External Sort. Per Zelaine's suggestion I enabled
> sort.external.disable_managed in drill-override.conf and in the sqlline
> session. This caused the query to run for longer but it still would fail with
> the same message.
> Per Paul's suggestion, I enabled debug logging for the
> org.apache.drill.exec.physical.impl.xsort.managed package and re-ran the
> query.
> Here's the initial DEBUG line for ExternalSortBatch for our query:
> bq. 2017-05-03 12:02:56,095 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:15]
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Config: memory limit = 10737418240,
> spill file size = 268435456, spill batch size = 8388608, merge limit =
> 2147483647, merge batch size = 16777216
> And here's the last DEBUG line before the stack trace:
> bq. 2017-05-03 12:37:44,249 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:4]
> DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Available memory: 10737418240,
> buffer memory = 10719535268, merge memory = 10707140978
> And the stacktrace:
> {quote}
> 2017-05-03 12:38:02,927 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] INFO
> o.a.d.e.p.i.x.m.ExternalSortBatch - User Error Occurred: External Sort
> encountered an error while spilling to disk (Un
> able to allocate buffer of size 268435456 due to memory limit. Current
> allocation: 10579849472)
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External
> Sort encountered an error while spilling to disk
> [Error Id: 5d53c677-0cd9-4c01-a664-c02089670a1c ]
> at
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544)
> ~[drill-common-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1447)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:1376)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory(ExternalSortBatch.java:1339)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.processBatch(ExternalSortBatch.java:831)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:618)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:660)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:559)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext(StreamingAggBatch.java:137)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:144)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:94)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:232)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:226)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at java.security.AccessController.doPrivileged(Native Method)
> [na:1.8.0_111]
> at javax.security.auth.Subject.doAs(Subject.java:422) [na:1.8.0_111]
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> [hadoop-common-2.7.1.jar:na]
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:226)
> [drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
> [drill-common-1.10.0.jar:1.10.0]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_111]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_111]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]
> Caused by: org.apache.drill.exec.exception.OutOfMemoryException: Unable to
> allocate buffer of size 268435456 due to memory limit. Current allocation:
> 10579849472
> at
> org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:220)
> ~[drill-memory-base-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:195)
> ~[drill-memory-base-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.vector.VarCharVector.reAlloc(VarCharVector.java:425)
> ~[vector-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.vector.VarCharVector.copyFromSafe(VarCharVector.java:278)
> ~[vector-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.vector.NullableVarCharVector.copyFromSafe(NullableVarCharVector.java:379)
> ~[vector-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.test.generated.PriorityQueueCopierGen140.doCopy(PriorityQueueCopierTemplate.java:22)
> ~[na:na]
> at
> org.apache.drill.exec.test.generated.PriorityQueueCopierGen140.next(PriorityQueueCopierTemplate.java:76)
> ~[na:na]
> at
> org.apache.drill.exec.physical.impl.xsort.managed.CopierHolder$BatchMerger.next(CopierHolder.java:234)
> ~[drill-java-exec-1.10.0.jar:1.10.0]
> at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1408)
> [drill-java-exec-1.10.0.jar:1.10.0]
> ... 24 common frames omitted
> {quote}
> I'm in communication with Paul and will send him the full log file.
> Thanks,
> Nathan
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)