[
https://issues.apache.org/jira/browse/DRILL-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers updated DRILL-7055:
-------------------------------
Summary: Project operator cannot handle wildcard + partition cols (was:
Project operator cannot handle wildcard + implicit cols)
> Project operator cannot handle wildcard + partition cols
> --------------------------------------------------------
>
> Key: DRILL-7055
> URL: https://issues.apache.org/jira/browse/DRILL-7055
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.15.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
>
> In the last year, Calcite appears to have added the ability to specify a
> wildcard plus extra columns. When used with implicit columns, we can now say:
> {code:sql}
> SELECT *, filename FROM myTable;
> {code}
> However, while the readers (at least the CSV reader) can handle this case,
> the {{ProjectRecordBatch}} cannot.
> Modify the {{TestCsv.java}} test case with the following test:
> {code:java}
> @Test
> public void testImplicitColWildcard() throws IOException {
> String sql = "SELECT *, filename FROM `dfs.data`.`%s`";
> RowSet actual = client.queryBuilder().sql(sql, CASE2_FILE_NAME).rowSet();
> actual.print();
> TupleMetadata expectedSchema = new SchemaBuilder()
> .add("a", MinorType.VARCHAR)
> .add("b", MinorType.VARCHAR)
> .add("c", MinorType.VARCHAR)
> .addNullable("filename", MinorType.VARCHAR)
> .buildSchema();
> RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
> .addRow("10", "foo", "bar", CASE2_FILE_NAME)
> .build();
> RowSetUtilities.verify(expected, actual);
> }
> {code}
> The output of the {{actual.print()}} is:
> {noformat}
> #: a, b, c, filename
> 0: "10", "foo", "bar", "case2.csv"
> {noformat}
> Now, try the same thing, but substitute "dir0" for "filename". We would
> expect to see something like the above. What we actually see is:
> {noformat}
> #: a, b, c, dir0, dir00
> 0: "10", "foo", "bar", null, null
> {noformat}
> Note that I'm trying this on a "new" CSV reader that fills in "dir0". To see
> the same thing on the master branch, put the CSV file under a directory and
> query the directory.
> The problem is traced to
> [here|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L592]:
> {code:java}
> private boolean isImplicitFileColumn(ValueVector vvIn) {
> return
> ColumnExplorer.initImplicitFileColumns(context.getOptions()).get(vvIn.getField().getName())
> != null;
> }
> {code}
> This has two problems:
> 1. It creates a map of implicit column names, but does not handle parsing
> names like "dir0".
> 2. It creates the map over and over: once per column per schema change. Very
> inefficient.
> The solution is to modify the code to use the {{isPartitionColumn()}} method
> in {{ColumnExplorer}}. Plus, create the {{ColumnExplorer}} once per project
> operator instance and reuse it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)