[ https://issues.apache.org/jira/browse/DRILL-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kunal Khatua updated DRILL-6211: -------------------------------- Labels: performance (was: ) > Optimizations for SelectionVectorRemover > ----------------------------------------- > > Key: DRILL-6211 > URL: https://issues.apache.org/jira/browse/DRILL-6211 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Codegen > Reporter: Kunal Khatua > Assignee: Karthikeyan Manivannan > Priority: Major > Labels: performance > Fix For: 1.15.0 > > Attachments: 255d264c-f55e-b343-0bef-49d3e672d93f.sys.drill, > 255d2664-2418-19e0-00ea-2076a06572a2.sys.drill, > 255d2682-8481-bed0-fc22-197a75371c04.sys.drill, > 255d26ae-2c0b-6cd6-ae71-4ad04c992daf.sys.drill, > 255d2880-48a2-d86b-5410-29ce0cd249ed.sys.drill > > > Currently, when a SelectionVectorRemover receives a record batch from an > upstream operator (like a Filter), it immediately starts copying over records > into a new outgoing batch. > It can be worthwhile if the RecordBatch can be enriched with some additional > summary statistics about the attached SelectionVector, such as > # number of records that need to be removed/copied > # total number of records in the record-batch > The benefit of this would be that in extreme cases, if *all* the records in a > batch need to be either truncated or copies, the SelectionVectorRemover can > simply drop the record-batch or simply forward it to the next downstream > operator. > While the extreme cases of simply dropping the batch kind of works (because > there is no overhead in copying), for cases where the record batch should > pass through, the overhead remains (and is actually more than 35% of the > time, if you discount for the streaming agg cost within the tests). > Here are the statistics of having such an optimization > ||Selectivity||Query Time||%Time used by SVR||Time||Profile|| > |0%|6.996|0.13%|0.0090948|[^255d264c-f55e-b343-0bef-49d3e672d93f.sys.drill]| > |10%|7.836|7.97%|0.6245292|[^255d2682-8481-bed0-fc22-197a75371c04.sys.drill]| > |50%|11.225|25.59%|2.8724775|[^255d2664-2418-19e0-00ea-2076a06572a2.sys.drill]| > |90%|14.966|33.91%|5.0749706|[^255d26ae-2c0b-6cd6-ae71-4ad04c992daf.sys.drill]| > |100%|19.003|35.73%|6.7897719|[^255d2880-48a2-d86b-5410-29ce0cd249ed.sys.drill]| > To summarize, the SVR should avoid creating new batches as much as possible. > A more generic (non-trivial) optimization should take into account the fact > that multiple batches emitted can be coalesced, but we don't currently have > test metrics for that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)