paul-rogers opened a new pull request #1675: DRILL-7055: Revise SELECT * to 
exclude partitions
URL: https://github.com/apache/drill/pull/1675
 
 
   Historically, a SELECT * (wildcard) query on a partitioned table included 
partition directory names as a set of "dir0", "dir1" columns. When used with 
files at differnt depths, this can lead to schema change exceptions as some 
readers create, say, "dir0" and "dir1", while others create just "dir0".
   
   The result is that either 1) things just work, 2) the client gets some 
batches with two partition columns, others with one, or 3) a hard schema change 
occurs as the project operator creates missing columns as nullable int.
   
   This change proposes to include table columns with using the wildcard and to 
no longer include partition columns. Partition columns will now work the way 
the "implicit" file columns already work, so this change improves consistency.
   
   The partition columns are still available: they can be requested explicitly:
   
   ```
   SELECT *, dir0, dir1 FROM ...
   ```
   
   Both before and after this change, when including the partition columns 
explicitly, the nullable int issue described above will occur. However, this 
change positions us for the revised scan framework that will properly provide 
the partition columns as nullable VARCHAR whether a matching directory exists 
or not.
   
   This is a potentially breaking change: any user that uses SELECT * and 
expects partition columns (and manages to work around the schema change issues) 
will see different behavior: they will have to revise queries to include 
partition columns.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to