[ https://issues.apache.org/jira/browse/DRILL-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873381#comment-15873381 ]
Paul Rogers commented on DRILL-5235: ------------------------------------ Run information: {code} Config: memory limit = 2147483648, batch limit = 2147483647, spill file size = 268435456, batch size = 8388608, merge limit = 2147483647, merge batch size = 16777216 Input Batch Estimates: record size = 250 bytes; input batch = 4317184 bytes, 8096 records Merge batch size = 8388608 bytes, 33554 records; spill file size: 268435456 bytes, 1073728 records Output batch size = 16777216 bytes, 65535 records Available memory: 2147483648, spill point = 21094400, min. merge memory = 117440512 {code} Last log line before OOM: {code} Spill: Memory = 2126266368, Buffered batch count = 2206, Spill batch count = 133 {code} The key symptom is that: {code} Spill point = 21,094,400 Memory before spill = 2,126,266,368 Memory limit = 2,147,483,648 {code} The sort somehow allowed more batches into memory beyond the spill point. As a result, not enough headroom was left for the 8 MB output batch. > Column or table alias doubles sort data size when reading a text file > --------------------------------------------------------------------- > > Key: DRILL-5235 > URL: https://issues.apache.org/jira/browse/DRILL-5235 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.9.0 > Reporter: Paul Rogers > Priority: Minor > > Consider a simple query that reads data from a pipe-separated-value file and > sorts it. The file has just one column. The query looks something like this: > {code} > SELECT columns[0] col1 FROM `dfs.data`.`input-file.tbl` ORDER BY col1 > {code} > Looking at the query plan, we see that a project operator not just creates an > alias {{col1}} for {{column\[0]}}, it also makes a *copy*. > The particular input file is 20 GB in size and contains just one column. As a > result of materializing the alias, data size to the sort doubles to 40 GB. > This results in doubling query run time. If the sort must spill to disk, run > times increases by a much larger factor. > The fix is to treat the alias as an alias, not a materialized copy. > {code} > { > "graph" : [ { > "pop" : "fs-scan", > "columns" : [ "`columns`[0]" ], > }, { > "pop" : "project", > "@id" : 4, > "exprs" : [ { > "ref" : "`col1`", > "expr" : "`columns`[0]" > } ], > }, { > "pop" : "external-sort", > "orderings" : [ { > "order" : "ASC", > "expr" : "`col1`", > "nullDirection" : "UNSPECIFIED" > } ], > }, { > "pop" : "selection-vector-remover", > }, { > "pop" : "project", > }, { > "pop" : "screen", > } ] > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)