[ https://issues.apache.org/jira/browse/DRILL-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers updated DRILL-5235: ------------------------------- Summary: Column alias or reference doubles sort data size when reading a text file (was: Column alias doubles sort data size when reading a text file) > Column alias or reference doubles sort data size when reading a text file > ------------------------------------------------------------------------- > > Key: DRILL-5235 > URL: https://issues.apache.org/jira/browse/DRILL-5235 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.9.0 > Reporter: Paul Rogers > Priority: Minor > > Consider a simple query that reads data from a pipe-separated-value file and > sorts it. The file has just one column. The query looks something like this: > {code} > SELECT columns[0] col1 FROM `dfs.data`.`input-file.tbl` ORDER BY col1 > {code} > Looking at the query plan, we see that a project operator not just creates an > alias {{col1}} for {{column\[0]}}, it also makes a *copy*. > The particular input file is 20 GB in size and contains just one column. As a > result of materializing the alias, data size to the sort doubles to 40 GB. > This results in doubling query run time. If the sort must spill to disk, run > times increases by a much larger factor. > The fix is to treat the alias as an alias, not a materialized copy. > {code} > { > "graph" : [ { > "pop" : "fs-scan", > "columns" : [ "`columns`[0]" ], > }, { > "pop" : "project", > "@id" : 4, > "exprs" : [ { > "ref" : "`col1`", > "expr" : "`columns`[0]" > } ], > }, { > "pop" : "external-sort", > "orderings" : [ { > "order" : "ASC", > "expr" : "`col1`", > "nullDirection" : "UNSPECIFIED" > } ], > }, { > "pop" : "selection-vector-remover", > }, { > "pop" : "project", > }, { > "pop" : "screen", > } ] > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)