[jira] [Updated] (DRILL-7055) Revise SELECT * to not include partitions

Paul Rogers (JIRA) Sun, 03 Mar 2019 12:58:47 -0800


     [ 
https://issues.apache.org/jira/browse/DRILL-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated DRILL-7055:
-------------------------------
    Description: 
Suppose you have a partitioned table:

{noformat}
myTable
  2018
    file1.csv
  2019
    file2.csv
{noformat}

For some time, Drill has included the partitions in a {{SELECT *}} query:

{noformat}
SELECT * FROM `myTable`;
col1.1, col1.2, 2018
col2.1, col2.2, 2019
{noformat}

There are two subtle issues with this behavior:

* The behavior of partitions is not consistent with the other file metadata 
(implicit) columns.
* Because directory depth can vary, and scan order is random, a hard schema 
change can occur if Drill starts scanning shallow files before deep files.

This ticked proposes to change the partition behavior to be like that of other 
file metadata columns: they are included only when requested:

{noformat}
SELECT * FROM `myTable`;
col1.1, col1.2
col2.1, col2.2

SELECT *, dir0, filename FROM `myTable`;
col1.1, col1.2, 2018, file1.csv
col2.1, col2.2, 2019, file2.csv
{noformat}

With this change, there is no possibility of a hard schema change: the user 
predefines the desired partitions.

Unfortunately, with the existing readers, a reader that does not have given 
partition will omit that partition column and will instead leave it to the 
projection operator to fill in the column, which it will do with a Nullable 
INT. (The new row-set based scan mechanism handles this case correctly.)

h4. Technical Background

In the last year, Calcite appears to have added the ability to specify a 
wildcard plus extra columns. When used with implicit columns, we can now say:

{code:sql}
SELECT *, filename FROM myTable;
{code}

However, while the readers (at least the CSV reader) can handle this case, the 
{{ProjectRecordBatch}} cannot.

Modify the {{TestCsv.java}} test case with the following test:

{code:java}
  @Test
  public void testImplicitColWildcard() throws IOException {
    String sql = "SELECT *, filename FROM `dfs.data`.`%s`";
    RowSet actual = client.queryBuilder().sql(sql, CASE2_FILE_NAME).rowSet();
    actual.print();

    TupleMetadata expectedSchema = new SchemaBuilder()
        .add("a", MinorType.VARCHAR)
        .add("b", MinorType.VARCHAR)
        .add("c", MinorType.VARCHAR)
        .addNullable("filename", MinorType.VARCHAR)
        .buildSchema();

    RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
        .addRow("10", "foo", "bar", CASE2_FILE_NAME)
        .build();
    RowSetUtilities.verify(expected, actual);
  }
{code}

The output of the {{actual.print()}} is:

{noformat}
#: a, b, c, filename
0: "10", "foo", "bar", "case2.csv"
{noformat}

Now, try the same thing, but substitute "dir0" for "filename". We would expect 
to see something like the above. What we actually see is:

{noformat}
#: a, b, c, dir0, dir00
0: "10", "foo", "bar", null, null
{noformat}

Note that I'm trying this on a "new" CSV reader that fills in "dir0". To see 
the same thing on the master branch, put the CSV file under a directory and 
query the directory.

The problem is traced to 
[here|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L592]:

{code:java}
  private boolean isImplicitFileColumn(ValueVector vvIn) {
    return 
ColumnExplorer.initImplicitFileColumns(context.getOptions()).get(vvIn.getField().getName())
 != null;
  }
{code}

This has two problems:

1. It creates a map of implicit column names, but does not handle parsing names 
like "dir0".
2. It creates the map over and over: once per column per schema change. Very 
inefficient.

The solution is to modify the code to use the {{isPartitionColumn()}} method in 
{{ColumnExplorer}}. Plus, create the {{ColumnExplorer}} once per project 
operator instance and reuse it.

  was:
In the last year, Calcite appears to have added the ability to specify a 
wildcard plus extra columns. When used with implicit columns, we can now say:

{code:sql}
SELECT *, filename FROM myTable;
{code}

However, while the readers (at least the CSV reader) can handle this case, the 
{{ProjectRecordBatch}} cannot.

Modify the {{TestCsv.java}} test case with the following test:

{code:java}
  @Test
  public void testImplicitColWildcard() throws IOException {
    String sql = "SELECT *, filename FROM `dfs.data`.`%s`";
    RowSet actual = client.queryBuilder().sql(sql, CASE2_FILE_NAME).rowSet();
    actual.print();

    TupleMetadata expectedSchema = new SchemaBuilder()
        .add("a", MinorType.VARCHAR)
        .add("b", MinorType.VARCHAR)
        .add("c", MinorType.VARCHAR)
        .addNullable("filename", MinorType.VARCHAR)
        .buildSchema();

    RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
        .addRow("10", "foo", "bar", CASE2_FILE_NAME)
        .build();
    RowSetUtilities.verify(expected, actual);
  }
{code}

The output of the {{actual.print()}} is:

{noformat}
#: a, b, c, filename
0: "10", "foo", "bar", "case2.csv"
{noformat}

Now, try the same thing, but substitute "dir0" for "filename". We would expect 
to see something like the above. What we actually see is:

{noformat}
#: a, b, c, dir0, dir00
0: "10", "foo", "bar", null, null
{noformat}

Note that I'm trying this on a "new" CSV reader that fills in "dir0". To see 
the same thing on the master branch, put the CSV file under a directory and 
query the directory.

The problem is traced to 
[here|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L592]:

{code:java}
  private boolean isImplicitFileColumn(ValueVector vvIn) {
    return 
ColumnExplorer.initImplicitFileColumns(context.getOptions()).get(vvIn.getField().getName())
 != null;
  }
{code}

This has two problems:

1. It creates a map of implicit column names, but does not handle parsing names 
like "dir0".
2. It creates the map over and over: once per column per schema change. Very 
inefficient.

The solution is to modify the code to use the {{isPartitionColumn()}} method in 
{{ColumnExplorer}}. Plus, create the {{ColumnExplorer}} once per project 
operator instance and reuse it.


> Revise SELECT * to not include partitions
> -----------------------------------------
>
>                 Key: DRILL-7055
>                 URL: https://issues.apache.org/jira/browse/DRILL-7055
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.15.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> Suppose you have a partitioned table:
> {noformat}
> myTable
>   2018
>     file1.csv
>   2019
>     file2.csv
> {noformat}
> For some time, Drill has included the partitions in a {{SELECT *}} query:
> {noformat}
> SELECT * FROM `myTable`;
> col1.1, col1.2, 2018
> col2.1, col2.2, 2019
> {noformat}
> There are two subtle issues with this behavior:
> * The behavior of partitions is not consistent with the other file metadata 
> (implicit) columns.
> * Because directory depth can vary, and scan order is random, a hard schema 
> change can occur if Drill starts scanning shallow files before deep files.
> This ticked proposes to change the partition behavior to be like that of 
> other file metadata columns: they are included only when requested:
> {noformat}
> SELECT * FROM `myTable`;
> col1.1, col1.2
> col2.1, col2.2
> SELECT *, dir0, filename FROM `myTable`;
> col1.1, col1.2, 2018, file1.csv
> col2.1, col2.2, 2019, file2.csv
> {noformat}
> With this change, there is no possibility of a hard schema change: the user 
> predefines the desired partitions.
> Unfortunately, with the existing readers, a reader that does not have given 
> partition will omit that partition column and will instead leave it to the 
> projection operator to fill in the column, which it will do with a Nullable 
> INT. (The new row-set based scan mechanism handles this case correctly.)
> h4. Technical Background
> In the last year, Calcite appears to have added the ability to specify a 
> wildcard plus extra columns. When used with implicit columns, we can now say:
> {code:sql}
> SELECT *, filename FROM myTable;
> {code}
> However, while the readers (at least the CSV reader) can handle this case, 
> the {{ProjectRecordBatch}} cannot.
> Modify the {{TestCsv.java}} test case with the following test:
> {code:java}
>   @Test
>   public void testImplicitColWildcard() throws IOException {
>     String sql = "SELECT *, filename FROM `dfs.data`.`%s`";
>     RowSet actual = client.queryBuilder().sql(sql, CASE2_FILE_NAME).rowSet();
>     actual.print();
>     TupleMetadata expectedSchema = new SchemaBuilder()
>         .add("a", MinorType.VARCHAR)
>         .add("b", MinorType.VARCHAR)
>         .add("c", MinorType.VARCHAR)
>         .addNullable("filename", MinorType.VARCHAR)
>         .buildSchema();
>     RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
>         .addRow("10", "foo", "bar", CASE2_FILE_NAME)
>         .build();
>     RowSetUtilities.verify(expected, actual);
>   }
> {code}
> The output of the {{actual.print()}} is:
> {noformat}
> #: a, b, c, filename
> 0: "10", "foo", "bar", "case2.csv"
> {noformat}
> Now, try the same thing, but substitute "dir0" for "filename". We would 
> expect to see something like the above. What we actually see is:
> {noformat}
> #: a, b, c, dir0, dir00
> 0: "10", "foo", "bar", null, null
> {noformat}
> Note that I'm trying this on a "new" CSV reader that fills in "dir0". To see 
> the same thing on the master branch, put the CSV file under a directory and 
> query the directory.
> The problem is traced to 
> [here|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L592]:
> {code:java}
>   private boolean isImplicitFileColumn(ValueVector vvIn) {
>     return 
> ColumnExplorer.initImplicitFileColumns(context.getOptions()).get(vvIn.getField().getName())
>  != null;
>   }
> {code}
> This has two problems:
> 1. It creates a map of implicit column names, but does not handle parsing 
> names like "dir0".
> 2. It creates the map over and over: once per column per schema change. Very 
> inefficient.
> The solution is to modify the code to use the {{isPartitionColumn()}} method 
> in {{ColumnExplorer}}. Plus, create the {{ColumnExplorer}} once per project 
> operator instance and reuse it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (DRILL-7055) Revise SELECT * to not include partitions

Reply via email to