[ 
https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404137#comment-16404137
 ] 

Paul Rogers commented on DRILL-6223:
------------------------------------

Been reflecting on [~parthc]'s comment. The challenge here is that, as we like 
to joke, "if something seems simple in Drill, it is only because we are missing 
something." More seriously, Drill combines relational theory, schema-less 
files, and a distributed-in-memory system in ways that leads to high complexity 
and, at times, outright contradictions. Nowhere is that more clear than in 
schema handling.

To Parth's point, consider our example of two files in a directory {{d = [p(a, 
b), q(a)]}}. Suppose I do the following:

{noformat}
SELECT a, b GROUP BY b FROM d
{noformat}

Today, one of two things will happen. If {{b}} happens to be a {{nullable 
INT}}, then the result will come back with all nulls (including all rows from 
{{q}}) in one group, along with other groups for the {{b}} values from {{p}}. 
This is what we expect.

Now, suppose, today, we do:

{noformat}
SELECT * GROUP BY b FROM d
{noformat}

Today, we get the same result as the first case. (Or, I hope we do; I have not 
actually tried this...) The reason is that the wildcard should expand to {{a, 
b}}, resulting in the same query.

Of course, if {{b}} turns out to be a {{VARCHAR}}, then everything fails 
(schema exception), because Drill can't group by variable types. (This is not a 
Drill restriction; relational theory is based on fixed domains for each column; 
no one has defined rules for what to do if a single column comes from multiple 
domains.)

Technically, the current behavior for the wildcard is to produce he *union* of 
columns from input files, causing the result we want above. The change seems to 
propose to change the behavior to the *intersection* of columns. As [~parthc] 
notes, this will break queries.

Run the explicit select query above with the proposed change. The result will 
be the same as today. Run the wildcard query. Now, the behavior is 
unpredictable.

If we do a two-phase grouping, we will group the rows from {{p}}, grouping them 
by {{b}}. Then we group the rows from {{q}}, will notice that {{b}} does not 
exist, and either fail the query or generate the {{nullable INT}} column and 
group by a set of null values. Once the results are combined, {{b}} exists in 
both input paths and so threre is no dangling column. We get the same results 
as today, which is good.

But, suppose we do the grouping *after* merging results. In this case, the 
dangling column rule kicks in, we remove column {{b}}, but then we recreate it 
in the group operator, resulting in all data from column {{b}} displaying as 
nulls.

Overall, I think this change would end up causing a never-ending set of bug 
requests as columns sometimes disappear and sometimes they don't.

Further, it is not clear if the fix even makes a schema change go away. Suppose 
we use the wildcard query and do partial grouping in the fragment with the 
scan, as explained above. Since the grouping must create a missing {{b}}, (and 
the distributed groups can't communicate to negotiatie which columns to drop), 
we still get a schema change error if {{p}} has column {{b}} as {{VARCHAR}}, 
but the grouping operator for {{q}} introduces a {{nullable INT}}.

So, overall, the goal of eliminating schema change is a good one; no user wants 
their query to fail. We've spent years trying to work out ad-hoc solutions that 
can be applied to each operator. But, local fixes can never solve a global 
problem (all scanners must agree on column number and type.) Hence, I've slowly 
come to realize that having a global schema is the only general solution.

There is another possible solution that I've been meaning to try. Modify the 
query to include a filter that creates a new column {{b1}} defined as something 
like "if the type of {{b}} is {{nullable VARCHAR}} then use the value of {{b}}, 
else use a {{NULL}} of type {{VARCHAR}}." There are some test queries of this 
type for the {{UNION}} type. The unfortunate bit is that this code must be 
inserted into every query. Or, the user must define a view. The extra 
calculations slow the query. Still, it may be the best solution we have at 
present.

Still, your PR mentions complex columns (I suppose maps and map arrays). As 
mentioned earlier, Drill cannot do the above calculations for nested columns, 
so the only solution would be, as part of the view, to project all nested 
columns up to the top level, which completely destroys the ability to do 
lateral joins or flattens.

This area does, indeed, need some serious design thought.

> Drill fails on Schema changes 
> ------------------------------
>
>                 Key: DRILL-6223
>                 URL: https://issues.apache.org/jira/browse/DRILL-6223
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.10.0, 1.12.0
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data 
> File (Parquet) Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within 
> nested data types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor 
> fragments are involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to