[ 
https://issues.apache.org/jira/browse/DRILL-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy closed DRILL-7083.
-------------------------------
    Resolution: Fixed

> Wrong data type for explicit partition column beyond file depth
> ---------------------------------------------------------------
>
>                 Key: DRILL-7083
>                 URL: https://issues.apache.org/jira/browse/DRILL-7083
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.15.0
>            Reporter: Paul Rogers
>            Priority: Minor
>             Fix For: 1.17.0
>
>
> Consider the simple case in DRILL-7082. That ticket talks about implicit 
> partition columns created by the wildcard. Consider a very similar case:
> {code:sql}
> SELECT a, b, c, dir0, dir1 FROM `myTable`
> {code}
> Where {{myTable}} is a directory of CSV files, each with schema {{(a, b, c)}}:
> {noformat}
> myTable
> |- file1.csv
> |- nested
>    |- file2.csv
> {noformat}
> If the query is run in "stock" Drill, the planner will place both files 
> within a single scan operator as described in DRILL-7082. The result schema 
> will be:
> {noformat}
> (a VARCHAR, b VARCHAR, c VARCHAR, dir0 VARCHAR, dir1 INT)
> {noformat}
> Notice that last column: why is "dir1" a (nullable) INT? The partition 
> mechanism only recognizes partitions that actually exist, leaving the Project 
> operator to fill in (with a Nullable INT) any partitions that don't exist 
> (any directory levels not actually seen by the scan operator.)
> Now, using the same trick as in DRILL-7082, try the query
> {code:sql}
> SELECT a, b, c, dir0 FROM `myTable`
> {code}
> Again, the trick causes Drill to read each file in a separate scan operator 
> (simulating what happens when queries run at scale.)
> The scan operator for {{file1.csv}} will see no partitions, so it will omit 
> "dir0" and the Project operator will helpfully fill in a Nullable INT. The 
> scan operator for {{file2.csv}} sees one level of partition, so sets {{dir0}} 
> to {{nested}} as a Nullable VARCHAR.
> What does the client see? Two records: one with "dir0" as a Nullable INT, the 
> other as a Nullable VARCHAR. Client such as JDBC and ODBC see a hard schema 
> change between the two records.
> The two cases described above are really two versions of the same issue. 
> Clients expect that, if they use the "dir0", "dir1", ... columns, that the 
> type is always Nullable Varchar so that the schema stays consistent across 
> batches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to