[
https://issues.apache.org/jira/browse/DRILL-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676121#comment-16676121
]
Paul Rogers commented on DRILL-6829:
------------------------------------
I did a bit of work in external sort and can comment on schema changes. My
advice is: don't try to handle it in sort; handle it during read with the
upcoming schema feature.
Let's say we have a number of batches queued up. When a new schema appears, we
must rewrite the existing batches. If those batches are on disk (due to
spilling) we must either a) read them from disk, modify vectors, and rewrite
them, or 2) keep a tally of which batches need which updates on reread.
Suppose we have three changes. We'd need each batch to recall its schema, and
the rules to apply to rewrite the data when rereading. Perhaps some batches
start as Varchar, later we decide to change to integer and still later to
float. Each batch will need different rules.
Memory allocation in sort is already very complex. Now, we must add additional
calculations to determine the new vectors we need, and account for both the new
and original vector sizes during that period in which we read the batch from
disk and rewrite it.
When all this is done, there will still be conflicts. If we have both Varchar
and Int data, should we convert them both to Int (they are numbers) or to
Varchar (they are product codes)? Suppose some are Varchar (string version of
date) and some are DateTime. How would we know what format to use to parse the
strings into dates, or format the dates into strings?
Now, imagine replicating this work in the other buffering operators: hash agg,
hash join, etc.
Each of these operators uses code gen to speed up operations. We'd need to code
gen every possible combination: incoming type is Int, convert to Varchar and
compare. Incoming type is Varchar, convert to Float and compare. Incoming type
is BigInt, convert to Double and compare, and so on. Imagine a big n x n matrix
(only the diagonal) of type conversions that must be code generated -- for each
possible operator. That is a huge amount of complexity to develop and test.
And what would we do in non-buffering operators? A hash agg group by in which
different groups end up with different types?
We'd also need to merge types in receivers. Node A reads a column as Int, Node
B as Varchar. Both nodes send to Node C. The exchange must reconcile these
types, else the next operator up will see a stream of schema change events
from, say, Int to Varchar back to Int, back to Varchar.
In short, we really don't want to make these decisions inside an operator such
as sort. Much better to do them in the reader (or scanner) in a unified way.
Finally, let's say that there is no buffering operator. We read 100 million
rows with column C as INT. Then we read another 100 million with column C as
Varchar. We've already sent the first 100 million to the JDBC client. How will
that client handle the Varchar? It can't go back in time and pick Varchar as
the type for all C values. The client will receive a schema change, and the
client won't be able to handle it.
But, insert a sort, and, voila, now the client sees no schema change. That
Weill be very hard to use, explain and support.
Could all this be made to work? Perhaps. This still overlooks more complex
issues such as a JSON column that morphs from a scalar to an object, or for a
JSON array that starts as all nulls, and later turns out to be Varchar. The
reader cannot even read this kind of data, so it does little good for the sort
(say) to try to handle the problem; sort is too late.
But, much better to tackle this at the source of the problem, which is the scan
operator. Once Drill can work with a schema, the query says up front the type
to use for each column, the scanner converts data to that type, and no other
operators encounter a schema change.
The rule is, with schema, we need a definition that will apply to all columns
at all times during a query. There is no good way to know, at the start of a
query, the types that will appear later. The only solid, workable,
tried-and-true solution is to require a schema up front so all operators play
by the same rules.
> Handle schema change in ExternalSort
> ------------------------------------
>
> Key: DRILL-6829
> URL: https://issues.apache.org/jira/browse/DRILL-6829
> Project: Apache Drill
> Issue Type: New Feature
> Reporter: Aman Sinha
> Priority: Major
>
> While we continue to enhance the schema provision and metastore aspects in
> Drill, we also should explore what it means to be truly schema-less such that
> we can better handle \{semi, un}structured data, data sitting in DBs that
> store JSON documents (e.g Mongo, MapR-DB).
>
> The blocking operators are the main hurdles in this goal (other operators
> also need to be smarter about this but the problem is harder for the blocking
> operators). This Jira is specifically about ExternalSort.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)