[
https://issues.apache.org/jira/browse/DRILL-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers updated DRILL-5235:
-------------------------------
Comment: was deleted
(was: Running on the Mac produces the expected log messages.
I did see unexpected messages about low-density batches from the text reader:
{code}
Saw low density batch. Density: 20
Saw low density batch. Density: 13 (9 times)
Saw low density batch. Density: 24 (10 times)
Saw low density batch. Density: 47 (3 times)
{code}
This query uses the compliant text reader, which suggests something is wrong
there.
Also the batches are not optimal size:
{code}
Actual batch schema & sizes {
T1¦¦columns(std col. size: 54, actual col. size: 0, total size: 2162688,
vector size: 0, data size: 0, row capacity: 8191, density: 0)
EXPR$1(std col. size: 54, actual col. size: 250, total size: 2138112, vector
size: 2129920, data size: 2024000, row capacity: 8191, density: 96)
Records: 8096, Total size: 4317184, Row width:535, Density:96}
{code}
That is, they are about 4 MB in size with 8096 records. Our target should be
much larger in both record size and record count. Not a bug, but is a potential
performance issue. That is a separate issue, however.
The good news (of a sort) is that the test case reproduces on the Mac -- the
sort did hit an OOM error.)
> Column or table alias doubles sort data size when reading a text file
> ---------------------------------------------------------------------
>
> Key: DRILL-5235
> URL: https://issues.apache.org/jira/browse/DRILL-5235
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.9.0
> Reporter: Paul Rogers
> Priority: Minor
>
> Consider a simple query that reads data from a pipe-separated-value file and
> sorts it. The file has just one column. The query looks something like this:
> {code}
> SELECT columns[0] col1 FROM `dfs.data`.`input-file.tbl` ORDER BY col1
> {code}
> Looking at the query plan, we see that a project operator not just creates an
> alias {{col1}} for {{column\[0]}}, it also makes a *copy*.
> The particular input file is 20 GB in size and contains just one column. As a
> result of materializing the alias, data size to the sort doubles to 40 GB.
> This results in doubling query run time. If the sort must spill to disk, run
> times increases by a much larger factor.
> The fix is to treat the alias as an alias, not a materialized copy.
> {code}
> {
> "graph" : [ {
> "pop" : "fs-scan",
> "columns" : [ "`columns`[0]" ],
> }, {
> "pop" : "project",
> "@id" : 4,
> "exprs" : [ {
> "ref" : "`col1`",
> "expr" : "`columns`[0]"
> } ],
> }, {
> "pop" : "external-sort",
> "orderings" : [ {
> "order" : "ASC",
> "expr" : "`col1`",
> "nullDirection" : "UNSPECIFIED"
> } ],
> }, {
> "pop" : "selection-vector-remover",
> }, {
> "pop" : "project",
> }, {
> "pop" : "screen",
> } ]
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)