[ 
https://issues.apache.org/jira/browse/DRILL-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy reopened DRILL-7082:
---------------------------------

> Inconsistent results with implicit partition columns, multi scans
> -----------------------------------------------------------------
>
>                 Key: DRILL-7082
>                 URL: https://issues.apache.org/jira/browse/DRILL-7082
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.15.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> The runtime behavior of implicit partition columns is wildly inconsistent to 
> the point of being unusable. Consider the following query:
> {code:sql}
> SELECT * FROM `myTable`
> {code}
> Where {{myTable}} is a directory of CSV files, each with schema {{(a, b, c)}}:
> {noformat}
> myTable
> |- file1.csv
> |- nested
>    |- file2.csv
> {noformat}
> Our test files are small. Turn out that, even if we write a test that scans a 
> few files, such as the above example, Drill will group all the reads into a 
> single fragment with a single scan operator. When that happens:
> * The partition columns appear before the data columns: (dir0, a, b, c).
> * The partition columns always appear in every row.
> We get the above result because a single scan operator sees both files and 
> knows the right number of partition columns to create for each.
> But, we know that, if two scans each read files at different depths, the 
> "shallower" one won't see as many partition directories as the "deeper" one. 
> To test this, I modified the text reader to accept a new session option that 
> sets the minimum parallelization. I set it to 2 (same as the number of 
> files.) One could probably also see this by creating large text files so that 
> the Drill parallelizer will choose to create two fragments.
> Then, I ran the above query 10 times. Now, I get these results:
> * Half the time, the first row has only the data columns (a, b, c), the other 
> half of the time the first row has a partition column. (Depending on which 
> file returned data first.)
> * Some of the time the partition column appears in the first position (dir0, 
> a, b, c) and some of the time in the last (a, b, c, dir0). (I have no idea 
> why.)
> The result is, from a two-file query, depending on random factors, your first 
> row schema could be:
> * (a, b, c)
> * (dir0, a, b, c)
> * (a, b, c, dir0)
> In many cases, the second row comes with a hard schema change to a different 
> format.
> The above is demonstrated in the (soon to be provided) {{TestPartitionRace}} 
> unit test.
> IMHO, the behavior is basically unusable as any JDBC/ODBC client will see an 
> inconsistent, changing schema. Instead, what a user would expect is:
> * The partition columns are in the same location in every row (preferably at 
> the end, so data columns remain in fixed positions regardless of the number 
> of partition columns.)
> * The same number of columns in every row. This means that all scan operators 
> must use a single uniform partition depth count, preferably set at plan type 
> in the group scan node that has visibility to all the files to scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to