[ https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16403808#comment-16403808 ]
Paul Rogers commented on DRILL-6223: ------------------------------------ [~sachouche], schema changes in Drill are complex and we've all had the fun of thinking we understand them, only to find out that there are implicit rules in Drill that complicate things. There are two cases to consider here. First, a {{SELECT a, b, c}} (explicit) projection list. In this case, the *reader* is supposed to project only the requested columns (and fill in nulls for missing columns.) This is pretty simple for top-level column, but gets very tricky for nested columns. It is important to be aware (for any future readers) that Drill does allow us to specify nested columns: {{SELECT m.a, m.b}}. But this does not do what you think, it does not return a map {{m}} with only columns {{a}} and {{b}}. Instead, it (possibly helpfully) projects the columns to the top level as {{EXPR$1}} and {{EXPR$2}}. There is no effective way to project a map as a map, and decide which columns. It is all or nothing. In the explicit case, there is nothing for the Project operator to do, the reader will have done all the needed work to produce a schema that matches the project list. Here, we must be aware of the fatal flaw in the "schema-free" idea as Drill implements it. We mentioned missing columns. Drill readers assign the type of {{Nullable INT}} to those. Suppose the query is {{SELECT a, b FROM `someday`}} in which file "p.parquet" has both columns (as {{VARCHAR}}) but file {{q.parquet}} has only {{a}}. Then, the reader for {{q}} will add a column {{b}} of type {{Nullable INT}}. This will cause an schema change exception downstream. This is a known, fundamental problem. Would be great if {{q}} could say, "Hey, I don't have column {{b}}. So, I'll give you a vector of unspecified type, you supply the type later." But, today, Drill does not work that way. OK, that's the first case. Now onto the second: wildcard: {{SELECT *}}. In this case, the reader must return all columns from each file. Back to our two files: {{p}} returns {{(a, b)}}, {{q}} returns just {{(a)}}. What does the user expect? Your change says the user expects only {{(a))}} (the change identifies {{b}} as dangling.) But, is that correct? Historically Drill says that we expect schema evolution. In that environment, we want both columns, but we want the "dangling" columns to be filled with nulls where they are missing. That is, the proper result is {{(a, b)}}, with the {{b}} columns filled with nulls for records that came from {{q}} (the file without {{b}}). The trick, of course, is that {{a}} may have a non-nullable type. (This is the issue that [~vitalii] was working on with DRILL-5970 a while back.) Overall, I don't think we want to remove columns that are "dangling"; the user could never query files that have evolved if we do. Instead, we need to find a way to include the "dangling" columns, filling in missing values. Now, why can I write so much about this? Much of the reader-side logic for this is implemented in the "batch size" set of changes that is slowly merging into master. The question here is touched up on in the [write-up|https://github.com/paul-rogers/drill/wiki/BH-Projection-Framework]. (This work did not attempt to address downstream behavior, that is an open issue. But, it does try to do "schema smoothing" so that, if we read file {{p}} and {{q}} in the same reader, the reader will fill in the missing column {{b}} for file {[q}} using the type from file {{p}}. Obviously, this trick only works if the files are read in the same reader, in the order {{p}}, {{q}}.) Rather than do this as a quiet bug fix, I suggest we write up a design for how we actually want Drill to behave. The only real way to get rational behavior is to define a schema up front so that all readers can coordinate, or the downstream operators know that they need to fill in columns and how to do so. That is a big task; one that has been put off for years, but the underlying conflict and ambiguities have not gone away. For Parquet (only), we have a possible solution. The planner checks all the files (I believe, I think that is why planning can be slow. Let the planner figure out the union of all the queries, using information from, say, file {{p}} in our example to know what column to fill in for file {{q}}. Pass this information into the physical plan. Let the readers use it to "do the right thing." The work mentioned earlier anticipates this solution: it provides a schema hint mechanism to fill in missing columns. Right now, it only looks at the very faint hints obtained from the project list, but is intended to use full schema info. > Drill fails on Schema changes > ------------------------------ > > Key: DRILL-6223 > URL: https://issues.apache.org/jira/browse/DRILL-6223 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators > Affects Versions: 1.10.0, 1.12.0 > Reporter: salim achouche > Assignee: salim achouche > Priority: Major > Fix For: 1.14.0 > > > Drill Query Failing when selecting all columns from a Complex Nested Data > File (Parquet) Set). There are differences in Schema among the files: > * The Parquet files exhibit differences both at the first level and within > nested data types > * A select * will not cause an exception but using a limit clause will > * Note also this issue seems to happen only when multiple Drillbit minor > fragments are involved (concurrency higher than one) -- This message was sent by Atlassian JIRA (v7.6.3#76005)