[jira] [Commented] (DRILL-6297) Define the Schema Change support functionality

Paul Rogers (JIRA) Wed, 28 Mar 2018 11:50:38 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417918#comment-16417918
 ]


Paul Rogers commented on DRILL-6297:
------------------------------------

Have not seen the proposal, obviously. To save another rehash of the same old 
issues, it is perhaps worth pointing out that there are many Jira tickets on 
this issue that should be reviewed. In particular, see DRILL-6035 as one of 
many examples.

Drill has long embraced the idea of a schema free system: input files need not 
agree on the set or type of columns, or on column names. The implicit idea 
(never actually documented or designed) is that, somehow, Drill operators will 
"do the right thing" to figure out that, say, columns "foo" and "fooBar" are 
two versions of the same column, that "Varchar" and "Int" types for the same 
column can be reconciled, or that even Int and Repeated Int can somehow be 
merged.

Add to that the idea that a missing column is typed as Nullable Int, when it 
might be typed as Map where it actually appears, and we have a 
nearly-intractable problem.

At the core is that there is no right answer: the problem is inherently 
ambiguous. If I have Int and Varchar, should the Int be converted to Varchar or 
visa-versa? How do I know the type of a column I've never seen?

It is not clear that this chaos actually provides any value whatsoever to the 
user. Suppose we decide on rules that don't fit the application. Now the user 
has to work around seemingly-arbitrary rules.

At the same time, Drill is SQL-based. SQL gets its power from working with 
relations over collections of domains. (Tables of columns with known, fixed 
types.) It is not clear that standard SQL can be extended to deal with columns 
of varying types, or table with varying columns. (There is an extension, SQL++ 
that proposes to do some of this; but Drill's claim to fame is that it supports 
standard SQL of the type produced by Tableau, etc.)

Drill does provide a partial solution: use lots of conditional statements and 
casts in each query to force the type conversions. These are often moved into 
views. Still, this is, at best, a poor-man's metadata system.

Let's contrast Drill with other Hadoop offerings, such as Hive. In Hive, a 
metadata system holds the type information for each table, eliminating all 
ambiguities. Every Hive-based application uses the same schema. The user or 
admin defines the schema once per table, not once per query.

Drill takes the other extreme: not only does Drill not require schema, Drill 
can't even use a schema unless the use uses the (much slower) Hive record 
readers.

The historical issues are that 1) Drill is advertised as schema free, 2) Drill 
tries to be independent of Hive, and 3) Hive does not support schema evolution 
-- something Drill claims to do.

In any case, IMHO, the proper solution to schema change errors is to avoid 
them: allow the user to specify a schema for each table, including rules to 
handle schema evolution (if column "x" does not appear in older files, assume 
it is a blank Varchar, say.)

Once Drill has a way to use a schema, then a separate topic is how Drill 
obtains the schema. The user could provide it. It can be imported from Hive. 
Drill can scan al the files and infer a schema. However, all these are 
secondary considerations. The core issue is that the schema is required to 
avoid schema ambiguity. The information is not to to be found through clever 
code changes; the information does not exist in the files. There are many 
possible interpretations of, say, a text field in a CSV file. So, the actual 
mapping is only known to the creator of the file; we need a way to convey that 
intent into Drill.


> Define the Schema Change support functionality
> ----------------------------------------------
>
>                 Key: DRILL-6297
>                 URL: https://issues.apache.org/jira/browse/DRILL-6297
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>
> The schema change support functionality is one of the main functional aspects 
> of Drill; unfortunately, there is no formal technical specification to this 
> key functionality which makes it very for:
>  * The Drill users to figure out what is the extent of schema changes support 
> and when it is safe to use it
>  * Development to support this functionality
>  
> Goal -
>  * The goal of this Jira is to deliver a functional specification for the 
> schema change functionality
>  * I'll create a strawman proposal based on previous input and hopefully 
> start a discussion to gradually refine it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6297) Define the Schema Change support functionality

Reply via email to