[
https://issues.apache.org/jira/browse/DRILL-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers updated DRILL-5366:
-------------------------------
Issue Type: Bug (was: Sub-task)
Parent: (was: DRILL-5318)
> Use generic copier for wide rows in external sort
> -------------------------------------------------
>
> Key: DRILL-5366
> URL: https://issues.apache.org/jira/browse/DRILL-5366
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.10.0
> Reporter: Paul Rogers
>
> The external sort makes use of a "priority copier" to copy rows at two times:
> * When merging data during spilling
> * When merging data for an in-memory sort
> As with all such Drill operators, the code works by generating two local
> variables per column, then generating two blocks of code per column (one for
> setup, one for the actual copy.)
> This works fine for rows with few columns. But, in rows with many columns
> (such as for queries against JSON documents), the amount of code produced
> becomes very large. This introduces extra overhead to generate, compile and
> store the extra code.
> DRILL-5125 found the same issue in the Selection Vector remover. By applying
> the fix from that ticket to the external sort, we reap 24% savings.
> Consider a unit test that runs only the copier. Create a single record batch
> with 1000 columns and 64K rows. Use the copier to produce a set of smaller
> output batches. Such a test factors out all the overhead of running a query.
> * Run time for a generated copier: 17 secs.
> * Run time with the generic copier: 13 secs
> * Savings: 4 seconds or 24%.
> (13 seconds is still a very long time to process 64K rows. There may be
> optimizations to be had in the priority queue implementation as well, but
> that is a separate issue.)
> To be conservative, provide a config option to enable the feature, perhaps by
> setting a threshold of the number of columns that must be present to use the
> generic version. That way, if folks feel that the generated version is faster
> for narrow rows, the generated version can be used. And each user can decide
> the point at which the costs of bulky code outweighs the performance costs.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)