[ 
https://issues.apache.org/jira/browse/DRILL-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943669#comment-15943669
 ] 

Jinfeng Ni edited comment on DRILL-5384 at 3/27/17 5:19 PM:
------------------------------------------------------------

The reason Drill decides to put a project for "customer.id" before sort 
operator kicks in : it greatly simplified the code for sort operator, as it 
does not have to deal with a complex schema path, without incurring too much 
overhead (vector transfer happens at batch level. it's merely reference 
transfer; no memory copy involved).

It would be great if you can make sort to handle complex schema path directly. 
However, I have doubt about such proposal's performance benefit, until the real 
performance measurement prove my suspicion is wrong. 
 

 


was (Author: jni):
The reason Drill decides to put a project for "customer.id" before sort 
operator kicks in : it greatly simplified the code for sort operator, as it 
does not have to deal with a complex schema path, without incurring too much 
overhead (vector transfer happens at batch level, not it's merely reference 
transfer; no memory copy involved).

It would be great if you can make sort to handle complex schema path directly. 
However, I have doubt about such proposal's performance benefit, until the real 
performance measurement prove my suspicion is wrong. 
 

 

> Sort cannot directly access map members, causes a data copy
> -----------------------------------------------------------
>
>                 Key: DRILL-5384
>                 URL: https://issues.apache.org/jira/browse/DRILL-5384
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Suppose we have a JSON structure for "orders" like this:
> {code}
> { customer: { id: 10, name: "fred" },
>   order: { id: 20, product: "Frammis 1000" } }
> {code}
> Suppose I want to sort by customer.id. Today, Drill will project customer.id 
> up to the top level as a temporary, hidden field. Drill will copy the data 
> from the customer.id vector to this new temporary field. Drill then sorts on 
> the temporary column, and uses another project to remove the columns.
> Clearly, this work, but it has a cost:
> * Extra two project operators.
> * Extra memory copy.
> * Sort must buffer both the original and copied data. This can double memory 
> use in the worst case.
> All of this is done simply to avoid having to reference "customer.id" in the 
> sort.
> But, as explained in DRILL-5376, maps are just nested tuples; there is no 
> need to copy the data, the data is already right there in a value vector. The 
> problem is that Drill's map implementation makes it hard for the generated 
> code to get at the "customer.id" vector.
> This ticket asks to allow the sort to work directly with nested scalars to 
> avoid the overhead explained above. To do this:
> 1. Fix nested scalar access to allow the generated code to easily access a 
> nested scalar.
> 2. Allow a sort key of the form "customer.id".
> 3. Modify the planner to generate such sort keys instead of the dual projects.
> The result will be a leaner, faster sort operation when sorting on scalars 
> within a map.
>   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to