[ 
https://issues.apache.org/jira/browse/DRILL-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677616#comment-16677616
 ] 

Paul Rogers commented on DRILL-6829:
------------------------------------

[~amansinha100], actually, understanding that this change is to support a 
MongoDB-like back-end, using a Drill client as a front end, there is a 
different set of solutions not constrained by the needs of xDBC users.

The external sort already supports schema change by transforming columns into a 
Union type. That transform is not memory managed because it did not seem worth 
it at the time. But, I seem to recall creating (or using) tests that did verify 
that the union support worked for non-key columns.

For a sort, the client will receive a union of all possible types since the 
sort sees all rows.

The receiver will need similar treatment. I seem to recall that we observed 
that it has to rebuild a schema on each schema change; it might as well be the 
thing that builds the unions on the second schema change.

Of course, the sort must still handle its own schema change since it will 
already have buffered batches with the first schema.

This solution solves the "time travel" issue because sort is unique in that it 
sees all rows (or, at least, all rows in a fragment.)

In the worst case, each sort instance might still see a different schema, and 
the merging receiver would need to merge them. A solution would be to merge all 
sort schemas in the merging receiver before delivering rows, converting 
incoming batches as needed prior to doing the final merge. (Or, if you are very 
brave, do the conversion row-by-row during the merge.)

Note that a different solution would be needed for joins, aggregations, etc. 
Simple queries without such operator to "smooth over" schema changes would 
simply deliver varying schemas to the Drill client.

And, just to be clear, the above works for MongoDB uses with Drill clients; not 
for JSON/Parquet/CSV users with xDBC clients.

I wonder with the MongoDB users will do with schemas that vary across queries, 
even for the same row (sometimes delivered as Double, say, other times 
delivered as a union with Double as a members...)

> Handle schema change in ExternalSort
> ------------------------------------
>
>                 Key: DRILL-6829
>                 URL: https://issues.apache.org/jira/browse/DRILL-6829
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Aman Sinha
>            Priority: Major
>
> While we continue to enhance the schema provision and metastore aspects in 
> Drill, we also should explore what it means to be truly schema-less such that 
> we can better handle \{semi, un}structured data, data sitting in DBs that 
> store JSON documents (e.g Mongo, MapR-DB). 
>  
> The blocking operators are the main hurdles in this goal (other operators 
> also need to be smarter about this but the problem is harder for the blocking 
> operators).   This Jira is specifically about ExternalSort. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to