[ https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404137#comment-16404137 ]
Paul Rogers commented on DRILL-6223: ------------------------------------ Been reflecting on [~parthc]'s comment. The challenge here is that, as we like to joke, "if something seems simple in Drill, it is only because we are missing something." More seriously, Drill combines relational theory, schema-less files, and a distributed-in-memory system in ways that leads to high complexity and, at times, outright contradictions. Nowhere is that more clear than in schema handling. To Parth's point, consider our example of two files in a directory {{d = [p(a, b), q(a)]}}. Suppose I do the following: {noformat} SELECT a, b GROUP BY b FROM d {noformat} Today, one of two things will happen. If {{b}} happens to be a {{nullable INT}}, then the result will come back with all nulls (including all rows from {{q}}) in one group, along with other groups for the {{b}} values from {{p}}. This is what we expect. Now, suppose, today, we do: {noformat} SELECT * GROUP BY b FROM d {noformat} Today, we get the same result as the first case. (Or, I hope we do; I have not actually tried this...) The reason is that the wildcard should expand to {{a, b}}, resulting in the same query. Of course, if {{b}} turns out to be a {{VARCHAR}}, then everything fails (schema exception), because Drill can't group by variable types. (This is not a Drill restriction; relational theory is based on fixed domains for each column; no one has defined rules for what to do if a single column comes from multiple domains.) Technically, the current behavior for the wildcard is to produce he *union* of columns from input files, causing the result we want above. The change seems to propose to change the behavior to the *intersection* of columns. As [~parthc] notes, this will break queries. Run the explicit select query above with the proposed change. The result will be the same as today. Run the wildcard query. Now, the behavior is unpredictable. If we do a two-phase grouping, we will group the rows from {{p}}, grouping them by {{b}}. Then we group the rows from {{q}}, will notice that {{b}} does not exist, and either fail the query or generate the {{nullable INT}} column and group by a set of null values. Once the results are combined, {{b}} exists in both input paths and so threre is no dangling column. We get the same results as today, which is good. But, suppose we do the grouping *after* merging results. In this case, the dangling column rule kicks in, we remove column {{b}}, but then we recreate it in the group operator, resulting in all data from column {{b}} displaying as nulls. Overall, I think this change would end up causing a never-ending set of bug requests as columns sometimes disappear and sometimes they don't. Further, it is not clear if the fix even makes a schema change go away. Suppose we use the wildcard query and do partial grouping in the fragment with the scan, as explained above. Since the grouping must create a missing {{b}}, (and the distributed groups can't communicate to negotiatie which columns to drop), we still get a schema change error if {{p}} has column {{b}} as {{VARCHAR}}, but the grouping operator for {{q}} introduces a {{nullable INT}}. So, overall, the goal of eliminating schema change is a good one; no user wants their query to fail. We've spent years trying to work out ad-hoc solutions that can be applied to each operator. But, local fixes can never solve a global problem (all scanners must agree on column number and type.) Hence, I've slowly come to realize that having a global schema is the only general solution. There is another possible solution that I've been meaning to try. Modify the query to include a filter that creates a new column {{b1}} defined as something like "if the type of {{b}} is {{nullable VARCHAR}} then use the value of {{b}}, else use a {{NULL}} of type {{VARCHAR}}." There are some test queries of this type for the {{UNION}} type. The unfortunate bit is that this code must be inserted into every query. Or, the user must define a view. The extra calculations slow the query. Still, it may be the best solution we have at present. Still, your PR mentions complex columns (I suppose maps and map arrays). As mentioned earlier, Drill cannot do the above calculations for nested columns, so the only solution would be, as part of the view, to project all nested columns up to the top level, which completely destroys the ability to do lateral joins or flattens. This area does, indeed, need some serious design thought. > Drill fails on Schema changes > ------------------------------ > > Key: DRILL-6223 > URL: https://issues.apache.org/jira/browse/DRILL-6223 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators > Affects Versions: 1.10.0, 1.12.0 > Reporter: salim achouche > Assignee: salim achouche > Priority: Major > Fix For: 1.14.0 > > > Drill Query Failing when selecting all columns from a Complex Nested Data > File (Parquet) Set). There are differences in Schema among the files: > * The Parquet files exhibit differences both at the first level and within > nested data types > * A select * will not cause an exception but using a limit clause will > * Note also this issue seems to happen only when multiple Drillbit minor > fragments are involved (concurrency higher than one) -- This message was sent by Atlassian JIRA (v7.6.3#76005)