[jira] [Commented] (DRILL-5235) Column or table alias doubles sort data size when reading a text file

Paul Rogers (JIRA) Sun, 19 Feb 2017 11:32:04 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873806#comment-15873806
 ]


Paul Rogers commented on DRILL-5235:
------------------------------------

The same problem occurs even without an alias:

{code}
select * from (select * from `dfs.data`.`250wide.tbl` order by columns[0]) 
where columns[0] = 'ljdfhwuehnoiueyf'
{code}

Batch analysis:

{code}
Actual batch schema & sizes {
  T1¦¦columns(std col. size: 54, actual col. size: 255, total size: 2162688, 
data size: 2056388, row capacity: 8191, density: 96)
  EXPR$1(std col. size: 54, actual col. size: 251, total size: 2138112, data 
size: 2032096, row capacity: 8191, density: 96)
  Records: 8096, Total size: 4317184, Row width:535, Density:96}
{code}

Execution plan:

{code}
{
  "head" : {
    ...
    "type" : "APACHE_DRILL_PHYSICAL",
    "options" : [ {
      "kind" : "LONG",
      "type" : "SESSION",
      "name" : "planner.memory.max_query_memory_per_node",
      "num_val" : 2147483648
    }, {
      "kind" : "LONG",
      "type" : "SESSION",
      "name" : "planner.width.max_per_node",
      "num_val" : 1
    }, {
      "kind" : "BOOLEAN",
      "type" : "SESSION",
      "name" : "planner.disable_exchanges",
      "bool_val" : true
    }, {
      "kind" : "BOOLEAN",
      "type" : "SESSION",
      "name" : "planner.enable_hashagg",
      "bool_val" : false
    } ],
    ...
  },
  "graph" : [ {
    "pop" : "fs-scan",
    "@id" : 9,
    ....,
    "files" : [ "file:/Users/paulrogers/work/data/250wide.tbl" ],
   ...,
    "columns" : [ "`*`" ],
    "selectionRoot" : "file:/Users/paulrogers/work/data/250wide.tbl",
    "cost" : 1.92801227E8
  }, {
    "pop" : "project",
    "@id" : 8,
    "exprs" : [ {
      "ref" : "`T0¦¦*`",
      "expr" : "`*`"
    }, {
      "ref" : "`columns`",
      "expr" : "`columns`"
    } ],
    "child" : 9,
    ...,
    "cost" : 1.92801227E8
  }, {
    "pop" : "project",
    "@id" : 7,
    "exprs" : [ {
      "ref" : "`T0¦¦*`",
      "expr" : "`T0¦¦*`"
    }, {
      "ref" : "`EXPR$1`",
      "expr" : "`columns`[0]"
    } ],
    "child" : 8,
    ...,
    "cost" : 1.92801227E8
  }, {
    "pop" : "external-sort",
    "@id" : 6,
    "child" : 7,
    "orderings" : [ {
      "order" : "ASC",
      "expr" : "`EXPR$1`",
      "nullDirection" : "UNSPECIFIED"
    } ],
    "reverse" : false,
    "initialAllocation" : 20000000,
    "maxAllocation" : 10000000000,
    "cost" : 1.92801227E8
  }, {
    "pop" : "selection-vector-remover",
    "@id" : 5,
    "child" : 6,
   ...,
    "cost" : 1.92801227E8
  }, {
    "pop" : "project",
    "@id" : 4,
    "exprs" : [ {
      "ref" : "`T0¦¦*`",
      "expr" : "`T0¦¦*`"
    } ],
    "child" : 5,
    ...,
    "cost" : 1.92801227E8
  }, {
    "pop" : "filter",
    "@id" : 3,
    "child" : 4,
    "expr" : "equal(`T0¦¦columns`[0], 'ljdfhwuehnoiueyf') ",
    ...,
    "cost" : 2.892018405E7
  }, {
    "pop" : "selection-vector-remover",
    "@id" : 2,
    "child" : 3,
    ...,
    "cost" : 2.892018405E7
  }, {
    "pop" : "project",
    "@id" : 1,
    "exprs" : [ {
      "ref" : "`*`",
      "expr" : "`T0¦¦*`"
    } ],
    "child" : 2,
    ...,
    "cost" : 2.892018405E7
  }, {
    "pop" : "screen",
    "@id" : 0,
    "child" : 1,
    ...,
    "cost" : 2.892018405E7
  } ]
}
{code}


> Column or table alias doubles sort data size when reading a text file
> ---------------------------------------------------------------------
>
>                 Key: DRILL-5235
>                 URL: https://issues.apache.org/jira/browse/DRILL-5235
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Consider a simple query that reads data from a pipe-separated-value file and 
> sorts it. The file has just one column. The query looks something like this:
> {code}
> SELECT columns[0] col1 FROM `dfs.data`.`input-file.tbl` ORDER BY col1
> {code}
> Looking at the query plan, we see that a project operator not just creates an 
> alias {{col1}} for {{column\[0]}}, it also makes a *copy*.
> The particular input file is 20 GB in size and contains just one column. As a 
> result of materializing the alias, data size to the sort doubles to 40 GB. 
> This results in doubling query run time. If the sort must spill to disk, run 
> times increases by a much larger factor.
> The fix is to treat the alias as an alias, not a materialized copy.
> {code}
> {
>   "graph" : [ {
>     "pop" : "fs-scan",
>     "columns" : [ "`columns`[0]" ],
>   }, {
>     "pop" : "project",
>     "@id" : 4,
>     "exprs" : [ {
>       "ref" : "`col1`",
>       "expr" : "`columns`[0]"
>     } ],
>   }, {
>     "pop" : "external-sort",
>     "orderings" : [ {
>       "order" : "ASC",
>       "expr" : "`col1`",
>       "nullDirection" : "UNSPECIFIED"
>     } ],
>   }, {
>     "pop" : "selection-vector-remover",
>   }, {
>     "pop" : "project",
>   }, {
>     "pop" : "screen",
>   } ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5235) Column or table alias doubles sort data size when reading a text file

Reply via email to