[jira] [Commented] (DRILL-5235) Column or table alias doubles sort data size when reading a text file

Paul Rogers (JIRA) Sat, 18 Feb 2017 16:49:13 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873381#comment-15873381
 ]


Paul Rogers commented on DRILL-5235:
------------------------------------

Run information:

{code}
Config: memory limit = 2147483648, batch limit = 2147483647, 
  spill file size = 268435456, batch size = 8388608,
  merge limit = 2147483647, merge batch size = 16777216
Input Batch Estimates: record size = 250 bytes; 
  input batch = 4317184 bytes, 8096 records
Merge batch size = 8388608 bytes, 33554 records; 
  spill file size: 268435456 bytes, 1073728 records
Output batch size = 16777216 bytes, 65535 records
Available memory: 2147483648, spill point = 21094400, 
  min. merge memory = 117440512
{code}

Last log line before OOM:

{code}
Spill: Memory = 2126266368, Buffered batch count = 2206, Spill batch count = 133
{code}

The key symptom is that:
{code}
Spill point = 21,094,400
Memory before spill = 2,126,266,368
Memory limit = 2,147,483,648
{code}

The sort somehow allowed more batches into memory beyond the spill point. As a 
result, not enough headroom was left for the 8 MB output batch.

> Column or table alias doubles sort data size when reading a text file
> ---------------------------------------------------------------------
>
>                 Key: DRILL-5235
>                 URL: https://issues.apache.org/jira/browse/DRILL-5235
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Consider a simple query that reads data from a pipe-separated-value file and 
> sorts it. The file has just one column. The query looks something like this:
> {code}
> SELECT columns[0] col1 FROM `dfs.data`.`input-file.tbl` ORDER BY col1
> {code}
> Looking at the query plan, we see that a project operator not just creates an 
> alias {{col1}} for {{column\[0]}}, it also makes a *copy*.
> The particular input file is 20 GB in size and contains just one column. As a 
> result of materializing the alias, data size to the sort doubles to 40 GB. 
> This results in doubling query run time. If the sort must spill to disk, run 
> times increases by a much larger factor.
> The fix is to treat the alias as an alias, not a materialized copy.
> {code}
> {
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "columns" : [ "`columns`[0]" ],
>   }, {
>     "pop" : "project",
>     "@id" : 4,
>     "exprs" : [ {
>       "ref" : "`col1`",
>       "expr" : "`columns`[0]"
>     } ],
>   }, {
>     "pop" : "external-sort",
>     "orderings" : [ {
>       "order" : "ASC",
>       "expr" : "`col1`",
>       "nullDirection" : "UNSPECIFIED"
>     } ],
>   }, {
>     "pop" : "selection-vector-remover",
>   }, {
>     "pop" : "project",
>   }, {
>     "pop" : "screen",
>   } ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5235) Column or table alias doubles sort data size when reading a text file

Reply via email to