Paul Rogers created DRILL-5366:
----------------------------------

             Summary: Use generic copier for wide rows in external sort
                 Key: DRILL-5366
                 URL: https://issues.apache.org/jira/browse/DRILL-5366
             Project: Apache Drill
          Issue Type: Sub-task
    Affects Versions: 1.10.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers
             Fix For: 1.11.0


The external sort makes use of a "priority copier" to copy rows at two times:

* When merging data during spilling
* When merging data for an in-memory sort

As with all such Drill operators, the code works by generating two local 
variables per column, then generating two blocks of code per column (one for 
setup, one for the actual copy.)

This works fine for rows with few columns. But, in rows with many columns (such 
as for queries against JSON documents), the amount of code produced becomes 
very large. This introduces extra overhead to generate, compile and store the 
extra code.

DRILL-5125 found the same issue in the Selection Vector remover. By applying 
the fix from that ticket to the external sort, we reap 24% savings.

Consider a unit test that runs only the copier. Create a single record batch 
with 1000 columns and 64K rows. Use the copier to produce a set of smaller 
output batches. Such a test factors out all the overhead of running a query.

* Run time for a generated copier: 17 secs.
* Run time with the generic copier: 13 secs
* Savings: 4 seconds or 24%.

(13 seconds is still a very long time to process 64K rows. There may be 
optimizations to be had in the priority queue implementation as well, but that 
is a separate issue.)

To be conservative, provide a config option to enable the feature, perhaps by 
setting a threshold of the number of columns that must be present to use the 
generic version. That way, if folks feel that the generated version is faster 
for narrow rows, the generated version can be used. And each user can decide 
the point at which the costs of bulky code outweighs the performance costs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to