[ 
https://issues.apache.org/jira/browse/DRILL-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-5366:
-------------------------------
      Priority: Minor  (was: Major)
    Issue Type: Improvement  (was: Bug)

> Use generic copier for wide rows in external sort
> -------------------------------------------------
>
>                 Key: DRILL-5366
>                 URL: https://issues.apache.org/jira/browse/DRILL-5366
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> The external sort makes use of a "priority copier" to copy rows at two times:
> * When merging data during spilling
> * When merging data for an in-memory sort
> As with all such Drill operators, the code works by generating two local 
> variables per column, then generating two blocks of code per column (one for 
> setup, one for the actual copy.)
> This works fine for rows with few columns. But, in rows with many columns 
> (such as for queries against JSON documents), the amount of code produced 
> becomes very large. This introduces extra overhead to generate, compile and 
> store the extra code.
> DRILL-5125 found the same issue in the Selection Vector remover. By applying 
> the fix from that ticket to the external sort, we reap 24% savings.
> Consider a unit test that runs only the copier. Create a single record batch 
> with 1000 columns and 64K rows. Use the copier to produce a set of smaller 
> output batches. Such a test factors out all the overhead of running a query.
> * Run time for a generated copier: 17 secs.
> * Run time with the generic copier: 13 secs
> * Savings: 4 seconds or 24%.
> (13 seconds is still a very long time to process 64K rows. There may be 
> optimizations to be had in the priority queue implementation as well, but 
> that is a separate issue.)
> To be conservative, provide a config option to enable the feature, perhaps by 
> setting a threshold of the number of columns that must be present to use the 
> generic version. That way, if folks feel that the generated version is faster 
> for narrow rows, the generated version can be used. And each user can decide 
> the point at which the costs of bulky code outweighs the performance costs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to