[
https://issues.apache.org/jira/browse/DRILL-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879664#comment-15879664
]
Paul Rogers edited comment on DRILL-5294 at 2/23/17 1:52 AM:
-------------------------------------------------------------
Basic statistics:
{code}
ExternalSortBatch - Config: memory limit = 126322567, spill file size =
268435456, batch size = 8388608,
merge limit = 2147483647, merge batch size = 12632257
{code}
The above line appears 17 times in the log, showing that the query has 17
slices (AKA minor fragments.) This can also be seen from the minor fragment
number in the log line:
{code}
[2751ce6d-67e6-ae08-3b68-e33b29f9d2a3:frag:1:16] ... ExternalSortBatch - Config
{code}
Memory calcs:
{code}
Input Batch Estimates: record size = 255 bytes; input batch = 365313 bytes,
1023 records
Merge batch size = 8388608 bytes, 32896 records; spill file size: 268435456
bytes
Output batch size = 12632257 bytes, 49538 records
Available memory: 126322567, buffer memory = 109180038, merge memory = 84004506
{code}
This says that the sort leaves a buffer of 126,322,567 - 109,180,038 =
17,142,529 bytes.
The node was given 32 GB of direct memory. Each sort is given 126,322,567 = 126
MB of memory for a total of 2 GB. So, the query is running with the default
value of max query memory per node and the query is using just 1/8 of available
direct memory.
Something is amiss with the "record batch sizer":
{code}
ExternalSortBatch - Memory delta: 526336, actual batch size: 365313, Diff:
161023
{code}
This does not cause the fault here, but the the "diff" should be 0 if the sizer
does its job.
No log lines suggest a spill occurred. Slice 11 got the point where it will
spill to disk (we see the code generated for {{PriorityQueueCopierGen56}}). At
this point, that same slice ran out of memory while spilling to disk.
The data file:
{code}
36,951,000,000 3500cols.tbl
{code}
Given that the data file is 37 GB in size, and sort memory is 2 GB (spread over
17 slices) considerable spilling should occur.
It may be that the copier needs more memory than was anticipated, and so the
sort held more rows in memory than it should have before beginning to spill.
However, according to the above, we should have plenty:
Reserve: 17,142,529
Spill batch size: 8388608
Unexpected: 161023
Net allowance: 8,592,898
Which should have been plenty of memory.
was (Author: paul-rogers):
Basic statistics:
{code}
ExternalSortBatch - Config: memory limit = 126322567, spill file size =
268435456, batch size = 8388608,
merge limit = 2147483647, merge batch size = 12632257
{code}
The above line appears 17 times in the log, showing that the query has 17
slices (AKA minor fragments.) This can also be seen from the minor fragment
number in the log line:
{code}
[2751ce6d-67e6-ae08-3b68-e33b29f9d2a3:frag:1:16] ... ExternalSortBatch - Config
{code}
The node was given 32 GB of direct memory. Each sort is given 126,322,567 = 126
MB of memory for a total of 2 GB. So, the query is running with the default
value of max query memory per node and the query is using just 1/8 of available
direct memory.
Something is amiss with the "record batch sizer":
{code}
ExternalSortBatch - Memory delta: 526336, actual batch size: 365313, Diff:
161023
{code}
This does not cause the fault here, but the the "diff" should be 0 if the sizer
does its job.
No log lines suggest a spill occurred. Slice 11 got the point where it will
spill to disk (we see the code generated for {{PriorityQueueCopierGen56}}). At
this point, that same slice ran out of memory while spilling to disk.
The data file:
{code}
36,951,000,000 3500cols.tbl
{code}
Given that the data file is 37 GB in size, and sort memory is 2 GB (spread over
17 slices) considerable spilling should occur.
It may be that the copier needs more memory than was anticipated, and so the
sort held more rows in memory than it should have before beginning to spill.
> Managed External Sort throws an OOM during the merge and spill phase
> --------------------------------------------------------------------
>
> Key: DRILL-5294
> URL: https://issues.apache.org/jira/browse/DRILL-5294
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Relational Operators
> Reporter: Rahul Challapalli
> Assignee: Paul Rogers
> Fix For: 1.10.0
>
> Attachments: 2751ce6d-67e6-ae08-3b68-e33b29f9d2a3.sys.drill,
> drillbit.log
>
>
> commit # : 38f816a45924654efd085bf7f1da7d97a4a51e38
> The below query fails with managed sort while it succeeds on the old sort
> {code}
> select * from (select columns[433] col433, columns[0],
> columns[1],columns[2],columns[3],columns[4],columns[5],columns[6],columns[7],columns[8],columns[9],columns[10],columns[11]
> from dfs.`/drill/testdata/resource-manager/3500cols.tbl` order by
> columns[450],columns[330],columns[230],columns[220],columns[110],columns[90],columns[80],columns[70],columns[40],columns[10],columns[20],columns[30],columns[40],columns[50])
> d where d.col433 = 'sjka skjf';
> Error: RESOURCE ERROR: External Sort encountered an error while spilling to
> disk
> Fragment 1:11
> [Error Id: 0aa20284-cfcc-450f-89b3-645c280f33a4 on qa-node190.qa.lab:31010]
> (state=,code=0)
> {code}
> Env :
> {code}
> No of Drillbits : 1
> DRILL_MAX_DIRECT_MEMORY="32G"
> DRILL_MAX_HEAP="4G"
> {code}
> Attached the logs and profile. Data is too large for a jira
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)