Yes, this would be a good enhancement. Any improvement to the efficiency/compactness of the generated code is complimentary to other optimizations such as parquet filter pushdown. I recall that there was a JIRA a while ago with hundreds or thousands of filter conditions creating a really bloated generated code - we should revisit that at some point to identify scope for improvement. I am not so sure about the UDF suggestion in #2. It seems like identifying why the large IN-list join approach was slow and fixing that would be a general solution.
Aman On Wed, Sep 9, 2015 at 1:31 PM, Jinfeng Ni <[email protected]> wrote: > Weeks ago there was a message on drill user list, reporting performance > issues caused by in list filter [1]. The query has filter: > > WHERE > c0 IN (v_00, v_01, v_02, v_03, ... ) > OR > c1 IN (v_11, v_11, v_12, v_13, ....) > OR > c2 IN ... > OR > c3 IN ... > OR > .... > > The profile shows that most of query time is spent on filter evaluation. > One workaround that we recommend was to re-write the query so that the > planner would convert in list into join operation. Turns out that > converting > into join did help improve performance, but not as much as we wanted. > > The original query has parquet as the data source. Therefore, the ideal > solution is parquet filter pushdown, which DRILL-1950 would address. > > On the other hand, I noticed that there seems to be room for improvement > in the run-time generated code. In particular, for " c0 in (v_00, v_01, > ...)", > Drill will evaluate it as : > c0 = v_00 OR c0 = v_01 OR ... > > Each reference of "c0" will lead to initialization of vector and holder > assignment in the generated code. There is redundant evaluation for > the common reference. > > I put together a patch,which will avoid the redundant evaluation for the > common reference. Using TPCH scale factor 10's lineitem table, I saw > quite surprising improvement. (run on Mac with embedded drillbit) > > 1) In List uses integer type [2] > master branch : 12.53 seconds > patch on top of master branch : 7.073 seconds > That's almost 45% improvement. > > 2) In List uses binary type [3] > master branch : 198.668 seconds > patch on top of master branch: 20.37 seconds > > Two thoughts: > 1. Will code size impact Janino compiler optimization or jvm hotspot > optimization? Otherwise, it seems hard to explain the performance > difference of removing the redundant evaluation. That might imply > that the efficiency of run-time generated code may degrade with > more expressions in the query (?) > > 2. For In-List filter, it might make sense to create a Drill UDF. The > UDF will build a heap-based hashtable in setup, in a similar way > as what the join approach will do. > > I'm going to open a JIRA to submit the patch for review, as I feel > it will benefit not only the in list filter, but also expressions with > common column references. > > > [1] > > https://mail-archives.apache.org/mod_mbox/drill-user/201508.mbox/%3CCAC-7oTym0Yzr2RmXhDPag6k41se-uTkWu0QC%3DMABb7s94DJ0BA%40mail.gmail.com%3E > > [2] https://gist.github.com/jinfengni/7f6df9ed7d2c761fed33 > > [3] https://gist.github.com/jinfengni/7460f6d250f0d00009ed >
