[ 
https://issues.apache.org/jira/browse/DRILL-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067409#comment-16067409
 ] 

Boaz Ben-Zvi commented on DRILL-5616:
-------------------------------------

Here is what likely caused the OOM – an internal doubling of the drill buffers 
used to hold those text key columns.
The incoming data has two text columns, with over 2 B records, the average 
length of each column is about 8 bytes, however someplace down the data stream 
there is a sizable collection of large values (about 250 byte). 
So the code was running with two partitions (each with two text columns). The 
estimated size of the batch was 12 M (based on sizing the initial incoming 
batch, then adjusting up when a large incoming batch was read).
The need to spill is only checked when a grouping-keys batch gets filled. The 
check is: Max_Available_Memory > Current_Memory + 3 * Estimated_Batch + "misc" 
. So we were at least 36MB below the limit.
Then at some point, the two partitions (with two columns) – i.e. 4 batches were 
partially filled – under 8MB each. more records are read (no memory check 
needed), and that causes each of the batches (buffers) to be doubled: Each 
doubling adds 8MB (allocate 16, then free 8) – 8 + 8 + 8 = 24MB. Then the last 
column allocates 16MB – at which point 24+16=40 , which is above the 36MB 
planned for – OOM.
Possible solution: At every memory check – just like computing memory needed 
for hash table doubling (see "misc" above) – also plan for varchar columns' 
buffers doubling. Need to mark the columns for varchars, then find a way to get 
their size, and sum them all up. Looks safe for other Hash Agg uses, though 
adds some overhead.

> Hash Agg Spill: OOM while reading irregular varchar data
> --------------------------------------------------------
>
>                 Key: DRILL-5616
>                 URL: https://issues.apache.org/jira/browse/DRILL-5616
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.11.0
>            Reporter: Boaz Ben-Zvi
>            Assignee: Boaz Ben-Zvi
>             Fix For: 1.11.0
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> An OOM while aggregating a table of two varchar columns where sizes vary 
> significantly (  about 8 bytes long in average, but 250 bytes max )
> alter session set `planner.width.max_per_node` = 1;
> alter session set `planner.memory.max_query_memory_per_node` = 327127360;
> select count( * ) from (select max(`filename`) from 
> dfs.`/drill/testdata/hash-agg/data2` group by no_nulls_col, nulls_col) d;
> {code}
> Error: RESOURCE ERROR: One or more nodes ran out of memory while executing 
> the query.
> OOM at Second Phase. Partitions: 2. Estimated batch size: 12255232. Planned 
> batches: 0. Rows spilled so far: 434127447 Memory limit: 163563680 so far 
> allocated: 150601728.
> Fragment 1:0
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to