[ 
https://issues.apache.org/jira/browse/DRILL-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879664#comment-15879664
 ] 

Paul Rogers edited comment on DRILL-5294 at 2/23/17 1:38 AM:
-------------------------------------------------------------

Basic statistics:

{code}
ExternalSortBatch - Config: memory limit = 126322567, spill file size = 
268435456, batch size = 8388608, 
    merge limit = 2147483647, merge batch size = 12632257
{code}

The above line appears 17 times in the log, showing that the query has 17 
slices (AKA minor fragments.) This can also be seen from the minor fragment 
number in the log line:

{code}
[2751ce6d-67e6-ae08-3b68-e33b29f9d2a3:frag:1:16] ... ExternalSortBatch - Config
{code}

The node was given 32 GB of direct memory. Each sort is given 126,322,567 = 126 
MB of memory for a total of 2 GB. So, the query is running with the default 
value of max query memory per node and the query is using just 1/8 of available 
direct memory.

Something is amiss with the "record batch sizer":

{code}
ExternalSortBatch - Memory delta: 526336, actual batch size: 365313, Diff: 
161023
{code}

This does not cause the fault here, but the the "diff" should be 0 if the sizer 
does its job.

No log lines suggest a spill occurred. Slice 11 got the point where it will 
spill to disk (we see the code generated for {{PriorityQueueCopierGen56}}). At 
this point, that same slice ran out of memory while spilling to disk.

The data file:

{code}
36,951,000,000 3500cols.tbl
{code}

Given that the data file is 37 GB in size, and sort memory is 2 GB (spread over 
17 slices) considerable spilling should occur.

It may be that the copier needs more memory than was anticipated, and so the 
sort held more rows in memory than it should have before beginning to spill.


was (Author: paul-rogers):
Basic statistics:

{code}
ExternalSortBatch - Config: memory limit = 126322567, spill file size = 
268435456, batch size = 8388608, 
    merge limit = 2147483647, merge batch size = 12632257
{code}

The above line appears 17 times in the log, showing that the query has 17 
slices (AKA minor fragments.) This can also be seen from the minor fragment 
number in the log line:

{code}
[2751ce6d-67e6-ae08-3b68-e33b29f9d2a3:frag:1:16] ... ExternalSortBatch - Config
{code}

The node was given 32 GB of direct memory. Each sort is given 126,322,567 = 126 
MB of memory for a total of 2 GB. So, the query is running with the default 
value of max query memory per node and the query is using just 1/8 of available 
direct memory.

Something is amiss with the "record batch sizer":

{code}
ExternalSortBatch - Memory delta: 526336, actual batch size: 365313, Diff: 
161023
{code}

This does not cause the fault here, but the the "diff" should be 0 if the sizer 
does its job.

No log lines suggest a spill occurred. One of the sorts got the point where it 
will spill to disk (we see the code generated for 
{{PriorityQueueCopierGen56}}). At this point, some slice ran out of memory 
while spilling to disk.

The data file:

{code}
36,951,000,000 3500cols.tbl
{code}

Given that the data file is 37 GB in size, and sort memory is 2 GB (spread over 
17 slices) considerable spilling should occur.

It may be that the copier needs more memory than was anticipated, and so the 
sort held more rows in memory than it should have before beginning to spill.

> Managed External Sort throws an OOM during the merge and spill phase
> --------------------------------------------------------------------
>
>                 Key: DRILL-5294
>                 URL: https://issues.apache.org/jira/browse/DRILL-5294
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>            Reporter: Rahul Challapalli
>            Assignee: Paul Rogers
>             Fix For: 1.10.0
>
>         Attachments: 2751ce6d-67e6-ae08-3b68-e33b29f9d2a3.sys.drill, 
> drillbit.log
>
>
> commit # : 38f816a45924654efd085bf7f1da7d97a4a51e38
> The below query fails with managed sort while it succeeds on the old sort
> {code}
> select * from (select columns[433] col433, columns[0], 
> columns[1],columns[2],columns[3],columns[4],columns[5],columns[6],columns[7],columns[8],columns[9],columns[10],columns[11]
>  from dfs.`/drill/testdata/resource-manager/3500cols.tbl` order by 
> columns[450],columns[330],columns[230],columns[220],columns[110],columns[90],columns[80],columns[70],columns[40],columns[10],columns[20],columns[30],columns[40],columns[50])
>  d where d.col433 = 'sjka skjf';
> Error: RESOURCE ERROR: External Sort encountered an error while spilling to 
> disk
> Fragment 1:11
> [Error Id: 0aa20284-cfcc-450f-89b3-645c280f33a4 on qa-node190.qa.lab:31010] 
> (state=,code=0)
> {code}
> Env : 
> {code}
> No of Drillbits : 1
> DRILL_MAX_DIRECT_MEMORY="32G"
> DRILL_MAX_HEAP="4G"
> {code}
> Attached the logs and profile. Data is too large for a jira



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to