[ 
https://issues.apache.org/jira/browse/DRILL-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943737#comment-15943737
 ] 

Paul Rogers commented on DRILL-5384:
------------------------------------

Hi [~jni]. I may be confusing two cases. Let's discuss arrays first. When a 
test case reads a text file with three columns, and uses those three columns as 
sort keys, I see an incoming row with six columns, three of which are from the 
project. Are you saying that the three projected copies are simply references 
to the three original columns, not copies? The code shows that we do, in fact, 
make a copy. The new "RecordBatchSizer" shows that the sum of all six columns 
equals the change in memory allocator "memory allocated" setting. If these are 
not copies, then the allocator is somehow being fooled into thinking they 
consume memory.

Now, back to the map. My assumption (which you suggest is wrong) is that map 
projects work the same way. I've not looked at that particular bit of code in 
detail, so I can't comment yet one way or the other.

For a design for how the sort might handle complex paths directly, look at the 
PR for the "RowSet" test tools. These provide a flattened schema that presents 
map columns as top-level columns (with dotted names) for ease of setting up and 
validating test cases. The thought is that the same approach used in that test 
code could be applied to the map code. Not a priority at the moment, but 
something to keep in mind when we want to minimize memory use and optimize 
performance.

> Sort cannot directly access map members, causes a data copy
> -----------------------------------------------------------
>
>                 Key: DRILL-5384
>                 URL: https://issues.apache.org/jira/browse/DRILL-5384
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Suppose we have a JSON structure for "orders" like this:
> {code}
> { customer: { id: 10, name: "fred" },
>   order: { id: 20, product: "Frammis 1000" } }
> {code}
> Suppose I want to sort by customer.id. Today, Drill will project customer.id 
> up to the top level as a temporary, hidden field. Drill will copy the data 
> from the customer.id vector to this new temporary field. Drill then sorts on 
> the temporary column, and uses another project to remove the columns.
> Clearly, this work, but it has a cost:
> * Extra two project operators.
> * Extra memory copy.
> * Sort must buffer both the original and copied data. This can double memory 
> use in the worst case.
> All of this is done simply to avoid having to reference "customer.id" in the 
> sort.
> But, as explained in DRILL-5376, maps are just nested tuples; there is no 
> need to copy the data, the data is already right there in a value vector. The 
> problem is that Drill's map implementation makes it hard for the generated 
> code to get at the "customer.id" vector.
> This ticket asks to allow the sort to work directly with nested scalars to 
> avoid the overhead explained above. To do this:
> 1. Fix nested scalar access to allow the generated code to easily access a 
> nested scalar.
> 2. Allow a sort key of the form "customer.id".
> 3. Modify the planner to generate such sort keys instead of the dual projects.
> The result will be a leaner, faster sort operation when sorting on scalars 
> within a map.
>   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to