[jira] [Commented] (DRILL-6829) Handle schema change in ExternalSort

Paul Rogers (JIRA) Mon, 05 Nov 2018 20:35:16 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676121#comment-16676121
 ]


Paul Rogers commented on DRILL-6829:
------------------------------------

I did a bit of work in external sort and can comment on schema changes. My 
advice is: don't try to handle it in sort; handle it during read with the 
upcoming schema feature.

Let's say we have a number of batches queued up. When a new schema appears, we 
must rewrite the existing batches. If those batches are on disk (due to 
spilling) we must either a) read them from disk, modify vectors, and rewrite 
them, or 2) keep a tally of which batches need which updates on reread.

Suppose we have three changes. We'd need each batch to recall its schema, and 
the rules to apply to rewrite the data when rereading. Perhaps some batches 
start as Varchar, later we decide to change to integer and still later to 
float. Each batch will need different rules.

Memory allocation in sort is already very complex. Now, we must add additional 
calculations to determine the new vectors we need, and account for both the new 
and original vector sizes during that period in which we read the batch from 
disk and rewrite it.

When all this is done, there will still be conflicts. If we have both Varchar 
and Int data, should we convert them both to Int (they are numbers) or to 
Varchar (they are product codes)? Suppose some are Varchar (string version of 
date) and some are DateTime. How would we know what format to use to parse the 
strings into dates, or format the dates into strings?

Now, imagine replicating this work in the other buffering operators: hash agg, 
hash join, etc.

Each of these operators uses code gen to speed up operations. We'd need to code 
gen every possible combination: incoming type is Int, convert to Varchar and 
compare. Incoming type is Varchar, convert to Float and compare. Incoming type 
is BigInt, convert to Double and compare, and so on. Imagine a big n x n matrix 
(only the diagonal) of type conversions that must be code generated -- for each 
possible operator. That is a huge amount of complexity to develop and test.

And what would we do in non-buffering operators? A hash agg group by in which 
different groups end up with different types?

We'd also need to merge types in receivers. Node A reads a column as Int, Node 
B as Varchar. Both nodes send to Node C. The exchange must reconcile these 
types, else the next operator up will see a stream of schema change events 
from, say, Int to Varchar back to Int, back to Varchar.

In short, we really don't want to make these decisions inside an operator such 
as sort. Much better to do them in the reader (or scanner) in a unified way.

Finally, let's say that there is no buffering operator. We read 100 million 
rows with column C as INT. Then we read another 100 million with column C as 
Varchar. We've already sent the first 100 million to the JDBC client. How will 
that client handle the Varchar? It can't go back in time and pick Varchar as 
the type for all C values. The client will receive a schema change, and the 
client won't be able to handle it.

But, insert a sort, and, voila, now the client sees no schema change. That 
Weill be very hard to use, explain and support.

Could all this be made to work? Perhaps. This still overlooks more complex 
issues such as a JSON column that morphs from a scalar to an object, or for a 
JSON array that starts as all nulls, and later turns out to be Varchar. The 
reader cannot even read this kind of data, so it does little good for the sort 
(say) to try to handle the problem; sort is too late.

But, much better to tackle this at the source of the problem, which is the scan 
operator. Once Drill can work with a schema, the query says up front the type 
to use for each column, the scanner converts data to that type, and no other 
operators encounter a schema change.

The rule is, with schema, we need a definition that will apply to all columns 
at all times during a query. There is no good way to know, at the start of a 
query, the types that will appear later. The only solid, workable, 
tried-and-true solution is to require a schema up front so all operators play 
by the same rules.

> Handle schema change in ExternalSort
> ------------------------------------
>
>                 Key: DRILL-6829
>                 URL: https://issues.apache.org/jira/browse/DRILL-6829
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Aman Sinha
>            Priority: Major
>
> While we continue to enhance the schema provision and metastore aspects in 
> Drill, we also should explore what it means to be truly schema-less such that 
> we can better handle \{semi, un}structured data, data sitting in DBs that 
> store JSON documents (e.g Mongo, MapR-DB). 
>  
> The blocking operators are the main hurdles in this goal (other operators 
> also need to be smarter about this but the problem is harder for the blocking 
> operators).   This Jira is specifically about ExternalSort. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6829) Handle schema change in ExternalSort

Reply via email to