[
https://issues.apache.org/jira/browse/DRILL-6297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417918#comment-16417918
]
Paul Rogers commented on DRILL-6297:
------------------------------------
Have not seen the proposal, obviously. To save another rehash of the same old
issues, it is perhaps worth pointing out that there are many Jira tickets on
this issue that should be reviewed. In particular, see DRILL-6035 as one of
many examples.
Drill has long embraced the idea of a schema free system: input files need not
agree on the set or type of columns, or on column names. The implicit idea
(never actually documented or designed) is that, somehow, Drill operators will
"do the right thing" to figure out that, say, columns "foo" and "fooBar" are
two versions of the same column, that "Varchar" and "Int" types for the same
column can be reconciled, or that even Int and Repeated Int can somehow be
merged.
Add to that the idea that a missing column is typed as Nullable Int, when it
might be typed as Map where it actually appears, and we have a
nearly-intractable problem.
At the core is that there is no right answer: the problem is inherently
ambiguous. If I have Int and Varchar, should the Int be converted to Varchar or
visa-versa? How do I know the type of a column I've never seen?
It is not clear that this chaos actually provides any value whatsoever to the
user. Suppose we decide on rules that don't fit the application. Now the user
has to work around seemingly-arbitrary rules.
At the same time, Drill is SQL-based. SQL gets its power from working with
relations over collections of domains. (Tables of columns with known, fixed
types.) It is not clear that standard SQL can be extended to deal with columns
of varying types, or table with varying columns. (There is an extension, SQL++
that proposes to do some of this; but Drill's claim to fame is that it supports
standard SQL of the type produced by Tableau, etc.)
Drill does provide a partial solution: use lots of conditional statements and
casts in each query to force the type conversions. These are often moved into
views. Still, this is, at best, a poor-man's metadata system.
Let's contrast Drill with other Hadoop offerings, such as Hive. In Hive, a
metadata system holds the type information for each table, eliminating all
ambiguities. Every Hive-based application uses the same schema. The user or
admin defines the schema once per table, not once per query.
Drill takes the other extreme: not only does Drill not require schema, Drill
can't even use a schema unless the use uses the (much slower) Hive record
readers.
The historical issues are that 1) Drill is advertised as schema free, 2) Drill
tries to be independent of Hive, and 3) Hive does not support schema evolution
-- something Drill claims to do.
In any case, IMHO, the proper solution to schema change errors is to avoid
them: allow the user to specify a schema for each table, including rules to
handle schema evolution (if column "x" does not appear in older files, assume
it is a blank Varchar, say.)
Once Drill has a way to use a schema, then a separate topic is how Drill
obtains the schema. The user could provide it. It can be imported from Hive.
Drill can scan al the files and infer a schema. However, all these are
secondary considerations. The core issue is that the schema is required to
avoid schema ambiguity. The information is not to to be found through clever
code changes; the information does not exist in the files. There are many
possible interpretations of, say, a text field in a CSV file. So, the actual
mapping is only known to the creator of the file; we need a way to convey that
intent into Drill.
> Define the Schema Change support functionality
> ----------------------------------------------
>
> Key: DRILL-6297
> URL: https://issues.apache.org/jira/browse/DRILL-6297
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: salim achouche
> Assignee: salim achouche
> Priority: Major
>
> The schema change support functionality is one of the main functional aspects
> of Drill; unfortunately, there is no formal technical specification to this
> key functionality which makes it very for:
> * The Drill users to figure out what is the extent of schema changes support
> and when it is safe to use it
> * Development to support this functionality
>
> Goal -
> * The goal of this Jira is to deliver a functional specification for the
> schema change functionality
> * I'll create a strawman proposal based on previous input and hopefully
> start a discussion to gradually refine it
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)