[
https://issues.apache.org/jira/browse/DRILL-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873806#comment-15873806
]
Paul Rogers commented on DRILL-5235:
------------------------------------
The same problem occurs even without an alias:
{code}
select * from (select * from `dfs.data`.`250wide.tbl` order by columns[0])
where columns[0] = 'ljdfhwuehnoiueyf'
{code}
Batch analysis:
{code}
Actual batch schema & sizes {
T1¦¦columns(std col. size: 54, actual col. size: 255, total size: 2162688,
data size: 2056388, row capacity: 8191, density: 96)
EXPR$1(std col. size: 54, actual col. size: 251, total size: 2138112, data
size: 2032096, row capacity: 8191, density: 96)
Records: 8096, Total size: 4317184, Row width:535, Density:96}
{code}
Execution plan:
{code}
{
"head" : {
...
"type" : "APACHE_DRILL_PHYSICAL",
"options" : [ {
"kind" : "LONG",
"type" : "SESSION",
"name" : "planner.memory.max_query_memory_per_node",
"num_val" : 2147483648
}, {
"kind" : "LONG",
"type" : "SESSION",
"name" : "planner.width.max_per_node",
"num_val" : 1
}, {
"kind" : "BOOLEAN",
"type" : "SESSION",
"name" : "planner.disable_exchanges",
"bool_val" : true
}, {
"kind" : "BOOLEAN",
"type" : "SESSION",
"name" : "planner.enable_hashagg",
"bool_val" : false
} ],
...
},
"graph" : [ {
"pop" : "fs-scan",
"@id" : 9,
....,
"files" : [ "file:/Users/paulrogers/work/data/250wide.tbl" ],
...,
"columns" : [ "`*`" ],
"selectionRoot" : "file:/Users/paulrogers/work/data/250wide.tbl",
"cost" : 1.92801227E8
}, {
"pop" : "project",
"@id" : 8,
"exprs" : [ {
"ref" : "`T0¦¦*`",
"expr" : "`*`"
}, {
"ref" : "`columns`",
"expr" : "`columns`"
} ],
"child" : 9,
...,
"cost" : 1.92801227E8
}, {
"pop" : "project",
"@id" : 7,
"exprs" : [ {
"ref" : "`T0¦¦*`",
"expr" : "`T0¦¦*`"
}, {
"ref" : "`EXPR$1`",
"expr" : "`columns`[0]"
} ],
"child" : 8,
...,
"cost" : 1.92801227E8
}, {
"pop" : "external-sort",
"@id" : 6,
"child" : 7,
"orderings" : [ {
"order" : "ASC",
"expr" : "`EXPR$1`",
"nullDirection" : "UNSPECIFIED"
} ],
"reverse" : false,
"initialAllocation" : 20000000,
"maxAllocation" : 10000000000,
"cost" : 1.92801227E8
}, {
"pop" : "selection-vector-remover",
"@id" : 5,
"child" : 6,
...,
"cost" : 1.92801227E8
}, {
"pop" : "project",
"@id" : 4,
"exprs" : [ {
"ref" : "`T0¦¦*`",
"expr" : "`T0¦¦*`"
} ],
"child" : 5,
...,
"cost" : 1.92801227E8
}, {
"pop" : "filter",
"@id" : 3,
"child" : 4,
"expr" : "equal(`T0¦¦columns`[0], 'ljdfhwuehnoiueyf') ",
...,
"cost" : 2.892018405E7
}, {
"pop" : "selection-vector-remover",
"@id" : 2,
"child" : 3,
...,
"cost" : 2.892018405E7
}, {
"pop" : "project",
"@id" : 1,
"exprs" : [ {
"ref" : "`*`",
"expr" : "`T0¦¦*`"
} ],
"child" : 2,
...,
"cost" : 2.892018405E7
}, {
"pop" : "screen",
"@id" : 0,
"child" : 1,
...,
"cost" : 2.892018405E7
} ]
}
{code}
> Column or table alias doubles sort data size when reading a text file
> ---------------------------------------------------------------------
>
> Key: DRILL-5235
> URL: https://issues.apache.org/jira/browse/DRILL-5235
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.9.0
> Reporter: Paul Rogers
> Priority: Minor
>
> Consider a simple query that reads data from a pipe-separated-value file and
> sorts it. The file has just one column. The query looks something like this:
> {code}
> SELECT columns[0] col1 FROM `dfs.data`.`input-file.tbl` ORDER BY col1
> {code}
> Looking at the query plan, we see that a project operator not just creates an
> alias {{col1}} for {{column\[0]}}, it also makes a *copy*.
> The particular input file is 20 GB in size and contains just one column. As a
> result of materializing the alias, data size to the sort doubles to 40 GB.
> This results in doubling query run time. If the sort must spill to disk, run
> times increases by a much larger factor.
> The fix is to treat the alias as an alias, not a materialized copy.
> {code}
> {
> "graph" : [ {
> "pop" : "fs-scan",
> "columns" : [ "`columns`[0]" ],
> }, {
> "pop" : "project",
> "@id" : 4,
> "exprs" : [ {
> "ref" : "`col1`",
> "expr" : "`columns`[0]"
> } ],
> }, {
> "pop" : "external-sort",
> "orderings" : [ {
> "order" : "ASC",
> "expr" : "`col1`",
> "nullDirection" : "UNSPECIFIED"
> } ],
> }, {
> "pop" : "selection-vector-remover",
> }, {
> "pop" : "project",
> }, {
> "pop" : "screen",
> } ]
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)