Yes, this would be a good enhancement.  Any improvement to the
efficiency/compactness of the generated code is complimentary to other
optimizations such as parquet filter pushdown.  I recall that there was a
JIRA a while ago with hundreds or thousands of filter conditions creating a
really bloated generated code  - we should revisit that at some point to
identify scope for improvement.
I am not so sure about the UDF suggestion in #2.   It seems like
identifying why the large IN-list join approach was slow and fixing that
would be a general solution.

Aman

On Wed, Sep 9, 2015 at 1:31 PM, Jinfeng Ni <[email protected]> wrote:

> Weeks ago there was a message on drill user list, reporting performance
> issues caused by in list filter [1].  The query has filter:
>
> WHERE
>    c0 IN (v_00, v_01, v_02, v_03, ... )
> OR
>    c1 IN (v_11, v_11, v_12, v_13, ....)
> OR
>    c2 IN ...
> OR
>    c3 IN ...
> OR
>    ....
>
> The profile shows that most of query time is spent on filter evaluation.
> One workaround that we recommend was to re-write the query so that the
> planner would convert in list into join operation. Turns out that
> converting
> into join did help improve performance, but not as much as we wanted.
>
> The original query has parquet as the data source. Therefore, the ideal
> solution is parquet filter pushdown, which DRILL-1950 would address.
>
> On the other hand, I noticed that there seems to be room for improvement
> in the run-time generated code. In particular, for " c0 in (v_00, v_01,
> ...)",
> Drill will evaluate it as :
>     c0 = v_00  OR c0 = v_01 OR ...
>
> Each reference of "c0" will lead to initialization of vector and holder
> assignment in the generated code. There is redundant evaluation for
> the common reference.
>
> I put together a patch,which will avoid the redundant evaluation for the
> common reference.  Using TPCH scale factor 10's lineitem table, I saw
> quite surprising improvement. (run on Mac with embedded drillbit)
>
> 1) In List uses integer type [2]
>   master branch :  12.53 seconds
>   patch on top of master branch : 7.073 seconds
> That's almost 45% improvement.
>
> 2) In List uses binary type [3]
>   master branch :  198.668 seconds
> patch on top of master branch: 20.37 seconds
>
> Two thoughts:
> 1. Will code size impact Janino compiler optimization or jvm hotspot
> optimization? Otherwise, it seems hard to explain the performance
> difference of removing the redundant evaluation. That might imply
> that the efficiency of run-time generated code may degrade with
> more expressions in the query (?)
>
> 2. For In-List filter, it might make sense to create a Drill UDF. The
> UDF will build a heap-based hashtable in setup, in a similar way
> as what the join approach will do.
>
>  I'm going to open a JIRA to submit the patch for review, as I feel
> it will benefit not only the in list filter, but also expressions with
> common column references.
>
>
> [1]
>
> https://mail-archives.apache.org/mod_mbox/drill-user/201508.mbox/%3CCAC-7oTym0Yzr2RmXhDPag6k41se-uTkWu0QC%3DMABb7s94DJ0BA%40mail.gmail.com%3E
>
> [2] https://gist.github.com/jinfengni/7f6df9ed7d2c761fed33
>
> [3]  https://gist.github.com/jinfengni/7460f6d250f0d00009ed
>

Reply via email to