[
https://issues.apache.org/jira/browse/DRILL-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anton Gozhiy reopened DRILL-7083:
---------------------------------
> Wrong data type for explicit partition column beyond file depth
> ---------------------------------------------------------------
>
> Key: DRILL-7083
> URL: https://issues.apache.org/jira/browse/DRILL-7083
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.15.0
> Reporter: Paul Rogers
> Priority: Minor
>
> Consider the simple case in DRILL-7082. That ticket talks about implicit
> partition columns created by the wildcard. Consider a very similar case:
> {code:sql}
> SELECT a, b, c, dir0, dir1 FROM `myTable`
> {code}
> Where {{myTable}} is a directory of CSV files, each with schema {{(a, b, c)}}:
> {noformat}
> myTable
> |- file1.csv
> |- nested
> |- file2.csv
> {noformat}
> If the query is run in "stock" Drill, the planner will place both files
> within a single scan operator as described in DRILL-7082. The result schema
> will be:
> {noformat}
> (a VARCHAR, b VARCHAR, c VARCHAR, dir0 VARCHAR, dir1 INT)
> {noformat}
> Notice that last column: why is "dir1" a (nullable) INT? The partition
> mechanism only recognizes partitions that actually exist, leaving the Project
> operator to fill in (with a Nullable INT) any partitions that don't exist
> (any directory levels not actually seen by the scan operator.)
> Now, using the same trick as in DRILL-7082, try the query
> {code:sql}
> SELECT a, b, c, dir0 FROM `myTable`
> {code}
> Again, the trick causes Drill to read each file in a separate scan operator
> (simulating what happens when queries run at scale.)
> The scan operator for {{file1.csv}} will see no partitions, so it will omit
> "dir0" and the Project operator will helpfully fill in a Nullable INT. The
> scan operator for {{file2.csv}} sees one level of partition, so sets {{dir0}}
> to {{nested}} as a Nullable VARCHAR.
> What does the client see? Two records: one with "dir0" as a Nullable INT, the
> other as a Nullable VARCHAR. Client such as JDBC and ODBC see a hard schema
> change between the two records.
> The two cases described above are really two versions of the same issue.
> Clients expect that, if they use the "dir0", "dir1", ... columns, that the
> type is always Nullable Varchar so that the schema stays consistent across
> batches.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)