[
https://issues.apache.org/jira/browse/DRILL-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers updated DRILL-7055:
-------------------------------
Description:
Suppose you have a partitioned table:
{noformat}
myTable
2018
file1.csv
2019
file2.csv
{noformat}
For some time, Drill has included the partitions in a {{SELECT *}} query:
{noformat}
SELECT * FROM `myTable`;
col1.1, col1.2, 2018
col2.1, col2.2, 2019
{noformat}
There are two subtle issues with this behavior:
* The behavior of partitions is not consistent with the other file metadata
(implicit) columns.
* Because directory depth can vary, and scan order is random, a hard schema
change can occur if Drill starts scanning shallow files before deep files.
This ticked proposes to change the partition behavior to be like that of other
file metadata columns: they are included only when requested:
{noformat}
SELECT * FROM `myTable`;
col1.1, col1.2
col2.1, col2.2
SELECT *, dir0, filename FROM `myTable`;
col1.1, col1.2, 2018, file1.csv
col2.1, col2.2, 2019, file2.csv
{noformat}
With this change, there is no possibility of a hard schema change: the user
predefines the desired partitions.
Unfortunately, with the existing readers, a reader that does not have given
partition will omit that partition column and will instead leave it to the
projection operator to fill in the column, which it will do with a Nullable
INT. (The new row-set based scan mechanism handles this case correctly.)
h4. Technical Background
In the last year, Calcite appears to have added the ability to specify a
wildcard plus extra columns. When used with implicit columns, we can now say:
{code:sql}
SELECT *, filename FROM myTable;
{code}
However, while the readers (at least the CSV reader) can handle this case, the
{{ProjectRecordBatch}} cannot.
Modify the {{TestCsv.java}} test case with the following test:
{code:java}
@Test
public void testImplicitColWildcard() throws IOException {
String sql = "SELECT *, filename FROM `dfs.data`.`%s`";
RowSet actual = client.queryBuilder().sql(sql, CASE2_FILE_NAME).rowSet();
actual.print();
TupleMetadata expectedSchema = new SchemaBuilder()
.add("a", MinorType.VARCHAR)
.add("b", MinorType.VARCHAR)
.add("c", MinorType.VARCHAR)
.addNullable("filename", MinorType.VARCHAR)
.buildSchema();
RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
.addRow("10", "foo", "bar", CASE2_FILE_NAME)
.build();
RowSetUtilities.verify(expected, actual);
}
{code}
The output of the {{actual.print()}} is:
{noformat}
#: a, b, c, filename
0: "10", "foo", "bar", "case2.csv"
{noformat}
Now, try the same thing, but substitute "dir0" for "filename". We would expect
to see something like the above. What we actually see is:
{noformat}
#: a, b, c, dir0, dir00
0: "10", "foo", "bar", null, null
{noformat}
Note that I'm trying this on a "new" CSV reader that fills in "dir0". To see
the same thing on the master branch, put the CSV file under a directory and
query the directory.
The problem is traced to
[here|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L592]:
{code:java}
private boolean isImplicitFileColumn(ValueVector vvIn) {
return
ColumnExplorer.initImplicitFileColumns(context.getOptions()).get(vvIn.getField().getName())
!= null;
}
{code}
This has two problems:
1. It creates a map of implicit column names, but does not handle parsing names
like "dir0".
2. It creates the map over and over: once per column per schema change. Very
inefficient.
The solution is to modify the code to use the {{isPartitionColumn()}} method in
{{ColumnExplorer}}. Plus, create the {{ColumnExplorer}} once per project
operator instance and reuse it.
was:
In the last year, Calcite appears to have added the ability to specify a
wildcard plus extra columns. When used with implicit columns, we can now say:
{code:sql}
SELECT *, filename FROM myTable;
{code}
However, while the readers (at least the CSV reader) can handle this case, the
{{ProjectRecordBatch}} cannot.
Modify the {{TestCsv.java}} test case with the following test:
{code:java}
@Test
public void testImplicitColWildcard() throws IOException {
String sql = "SELECT *, filename FROM `dfs.data`.`%s`";
RowSet actual = client.queryBuilder().sql(sql, CASE2_FILE_NAME).rowSet();
actual.print();
TupleMetadata expectedSchema = new SchemaBuilder()
.add("a", MinorType.VARCHAR)
.add("b", MinorType.VARCHAR)
.add("c", MinorType.VARCHAR)
.addNullable("filename", MinorType.VARCHAR)
.buildSchema();
RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
.addRow("10", "foo", "bar", CASE2_FILE_NAME)
.build();
RowSetUtilities.verify(expected, actual);
}
{code}
The output of the {{actual.print()}} is:
{noformat}
#: a, b, c, filename
0: "10", "foo", "bar", "case2.csv"
{noformat}
Now, try the same thing, but substitute "dir0" for "filename". We would expect
to see something like the above. What we actually see is:
{noformat}
#: a, b, c, dir0, dir00
0: "10", "foo", "bar", null, null
{noformat}
Note that I'm trying this on a "new" CSV reader that fills in "dir0". To see
the same thing on the master branch, put the CSV file under a directory and
query the directory.
The problem is traced to
[here|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L592]:
{code:java}
private boolean isImplicitFileColumn(ValueVector vvIn) {
return
ColumnExplorer.initImplicitFileColumns(context.getOptions()).get(vvIn.getField().getName())
!= null;
}
{code}
This has two problems:
1. It creates a map of implicit column names, but does not handle parsing names
like "dir0".
2. It creates the map over and over: once per column per schema change. Very
inefficient.
The solution is to modify the code to use the {{isPartitionColumn()}} method in
{{ColumnExplorer}}. Plus, create the {{ColumnExplorer}} once per project
operator instance and reuse it.
> Revise SELECT * to not include partitions
> -----------------------------------------
>
> Key: DRILL-7055
> URL: https://issues.apache.org/jira/browse/DRILL-7055
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.15.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
>
> Suppose you have a partitioned table:
> {noformat}
> myTable
> 2018
> file1.csv
> 2019
> file2.csv
> {noformat}
> For some time, Drill has included the partitions in a {{SELECT *}} query:
> {noformat}
> SELECT * FROM `myTable`;
> col1.1, col1.2, 2018
> col2.1, col2.2, 2019
> {noformat}
> There are two subtle issues with this behavior:
> * The behavior of partitions is not consistent with the other file metadata
> (implicit) columns.
> * Because directory depth can vary, and scan order is random, a hard schema
> change can occur if Drill starts scanning shallow files before deep files.
> This ticked proposes to change the partition behavior to be like that of
> other file metadata columns: they are included only when requested:
> {noformat}
> SELECT * FROM `myTable`;
> col1.1, col1.2
> col2.1, col2.2
> SELECT *, dir0, filename FROM `myTable`;
> col1.1, col1.2, 2018, file1.csv
> col2.1, col2.2, 2019, file2.csv
> {noformat}
> With this change, there is no possibility of a hard schema change: the user
> predefines the desired partitions.
> Unfortunately, with the existing readers, a reader that does not have given
> partition will omit that partition column and will instead leave it to the
> projection operator to fill in the column, which it will do with a Nullable
> INT. (The new row-set based scan mechanism handles this case correctly.)
> h4. Technical Background
> In the last year, Calcite appears to have added the ability to specify a
> wildcard plus extra columns. When used with implicit columns, we can now say:
> {code:sql}
> SELECT *, filename FROM myTable;
> {code}
> However, while the readers (at least the CSV reader) can handle this case,
> the {{ProjectRecordBatch}} cannot.
> Modify the {{TestCsv.java}} test case with the following test:
> {code:java}
> @Test
> public void testImplicitColWildcard() throws IOException {
> String sql = "SELECT *, filename FROM `dfs.data`.`%s`";
> RowSet actual = client.queryBuilder().sql(sql, CASE2_FILE_NAME).rowSet();
> actual.print();
> TupleMetadata expectedSchema = new SchemaBuilder()
> .add("a", MinorType.VARCHAR)
> .add("b", MinorType.VARCHAR)
> .add("c", MinorType.VARCHAR)
> .addNullable("filename", MinorType.VARCHAR)
> .buildSchema();
> RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
> .addRow("10", "foo", "bar", CASE2_FILE_NAME)
> .build();
> RowSetUtilities.verify(expected, actual);
> }
> {code}
> The output of the {{actual.print()}} is:
> {noformat}
> #: a, b, c, filename
> 0: "10", "foo", "bar", "case2.csv"
> {noformat}
> Now, try the same thing, but substitute "dir0" for "filename". We would
> expect to see something like the above. What we actually see is:
> {noformat}
> #: a, b, c, dir0, dir00
> 0: "10", "foo", "bar", null, null
> {noformat}
> Note that I'm trying this on a "new" CSV reader that fills in "dir0". To see
> the same thing on the master branch, put the CSV file under a directory and
> query the directory.
> The problem is traced to
> [here|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L592]:
> {code:java}
> private boolean isImplicitFileColumn(ValueVector vvIn) {
> return
> ColumnExplorer.initImplicitFileColumns(context.getOptions()).get(vvIn.getField().getName())
> != null;
> }
> {code}
> This has two problems:
> 1. It creates a map of implicit column names, but does not handle parsing
> names like "dir0".
> 2. It creates the map over and over: once per column per schema change. Very
> inefficient.
> The solution is to modify the code to use the {{isPartitionColumn()}} method
> in {{ColumnExplorer}}. Plus, create the {{ColumnExplorer}} once per project
> operator instance and reuse it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)