[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16403808#comment-16403808
 ] 

Paul Rogers commented on DRILL-6223:
------------------------------------

[~sachouche], schema changes in Drill are complex and we've all had the fun of 
thinking we understand them, only to find out that there are implicit rules in 
Drill that complicate things.

There are two cases to consider here.

First, a {{SELECT a, b, c}} (explicit) projection list. In this case, the 
*reader* is supposed to project only the requested columns (and fill in nulls 
for missing columns.) This is pretty simple for top-level column, but gets very 
tricky for nested columns.

It is important to be aware (for any future readers) that Drill does allow us 
to specify nested columns: {{SELECT m.a, m.b}}. But this does not do what you 
think, it does not return a map {{m}} with only columns {{a}} and {{b}}. 
Instead, it (possibly helpfully) projects the columns to the top level as 
{{EXPR$1}} and {{EXPR$2}}. There is no effective way to project a map as a map, 
and decide which columns. It is all or nothing.

In the explicit case, there is nothing for the Project operator to do, the 
reader will have done all the needed work to produce a schema that matches the 
project list.

Here, we must be aware of the fatal flaw in the "schema-free" idea as Drill 
implements it. We mentioned missing columns. Drill readers assign the type of 
{{Nullable INT}} to those. Suppose the query is {{SELECT a, b FROM `someday`}} 
in which file "p.parquet" has both columns (as {{VARCHAR}}) but file 
{{q.parquet}} has only {{a}}. Then, the reader for {{q}} will add a column 
{{b}} of type {{Nullable INT}}. This will cause an schema change exception 
downstream.

This is a known, fundamental problem. Would be great if {{q}} could say, "Hey, 
I don't have column {{b}}. So, I'll give you a vector of unspecified type, you 
supply the type later." But, today, Drill does not work that way.

OK, that's the first case. Now onto the second: wildcard: {{SELECT *}}. In this 
case, the reader must return all columns from each file. Back to our two files: 
{{p}} returns {{(a, b)}}, {{q}} returns just {{(a)}}. What does the user expect?

Your change says the user expects only {{(a))}} (the change identifies {{b}} as 
dangling.) But, is that correct?

Historically Drill says that we expect schema evolution. In that environment, 
we want both columns, but we want the "dangling" columns to be filled with 
nulls where they are missing. That is, the proper result is {{(a, b)}}, with 
the {{b}} columns filled with nulls for records that came from {{q}} (the file 
without {{b}}).

The trick, of course, is that {{a}} may have a non-nullable type. (This is the 
issue that [~vitalii] was working on with DRILL-5970 a while back.)

Overall, I don't think we want to remove columns that are "dangling"; the user 
could never query files that have evolved if we do. Instead, we need to find a 
way to include the "dangling" columns, filling in missing values.

Now, why can I write so much about this? Much of the reader-side logic for this 
is implemented in the "batch size" set of changes that is slowly merging into 
master. The question here is touched up on in the 
[write-up|https://github.com/paul-rogers/drill/wiki/BH-Projection-Framework]. 
(This work did not attempt to address downstream behavior, that is an open 
issue. But, it does try to do "schema smoothing" so that, if we read file {{p}} 
and {{q}} in the same reader, the reader will fill in the missing column {{b}} 
for file {[q}} using the type from file {{p}}. Obviously, this trick only works 
if the files are read in the same reader, in the order {{p}}, {{q}}.)

Rather than do this as a quiet bug fix, I suggest we write up a design for how 
we actually want Drill to behave. The only real way to get rational behavior is 
to define a schema up front so that all readers can coordinate, or the 
downstream operators know that they need to fill in columns and how to do so. 
That is a big task; one that has been put off for years, but the underlying 
conflict and ambiguities have not gone away.

For Parquet (only), we have a possible solution. The planner checks all the 
files (I believe, I think that is why planning can be slow. Let the planner 
figure out the union of all the queries, using information from, say, file 
{{p}} in our example to know what column to fill in for file {{q}}. Pass this 
information into the physical plan. Let the readers use it to "do the right 
thing." The work mentioned earlier anticipates this solution: it provides a 
schema hint mechanism to fill in missing columns. Right now, it only looks at 
the very faint hints obtained from the project list, but is intended to use 
full schema info.


> Drill fails on Schema changes 
> ------------------------------
>
>                 Key: DRILL-6223
>                 URL: https://issues.apache.org/jira/browse/DRILL-6223
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.10.0, 1.12.0
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to