In list filter evaluation : room for improvement in run-time code generation.

Jinfeng Ni Wed, 09 Sep 2015 13:32:13 -0700

Weeks ago there was a message on drill user list, reporting performance
issues caused by in list filter [1].  The query has filter:

WHERE
c0 IN (v_00, v_01, v_02, v_03, ... )
OR
c1 IN (v_11, v_11, v_12, v_13, ....)
OR
c2 IN ...
OR
c3 IN ...
OR
....

The profile shows that most of query time is spent on filter evaluation.
One workaround that we recommend was to re-write the query so that the
planner would convert in list into join operation. Turns out that
converting
into join did help improve performance, but not as much as we wanted.

The original query has parquet as the data source. Therefore, the ideal
solution is parquet filter pushdown, which DRILL-1950 would address.

On the other hand, I noticed that there seems to be room for improvement
in the run-time generated code. In particular, for " c0 in (v_00, v_01,
...)",
Drill will evaluate it as :
c0 = v_00 OR c0 = v_01 OR ...

Each reference of "c0" will lead to initialization of vector and holder
assignment in the generated code. There is redundant evaluation for
the common reference.

I put together a patch,which will avoid the redundant evaluation for the
common reference. Using TPCH scale factor 10's lineitem table, I saw
quite surprising improvement. (run on Mac with embedded drillbit)

1) In List uses integer type [2]
master branch : 12.53 seconds
patch on top of master branch : 7.073 seconds
That's almost 45% improvement.

2) In List uses binary type [3]
master branch : 198.668 seconds
patch on top of master branch: 20.37 seconds

Two thoughts:
1. Will code size impact Janino compiler optimization or jvm hotspot
optimization? Otherwise, it seems hard to explain the performance
difference of removing the redundant evaluation. That might imply
that the efficiency of run-time generated code may degrade with
more expressions in the query (?)

2. For In-List filter, it might make sense to create a Drill UDF. The
UDF will build a heap-based hashtable in setup, in a similar way
as what the join approach will do.

I'm going to open a JIRA to submit the patch for review, as I feel
it will benefit not only the in list filter, but also expressions with
common column references.

[1]
https://mail-archives.apache.org/mod_mbox/drill-user/201508.mbox/%3CCAC-7oTym0Yzr2RmXhDPag6k41se-uTkWu0QC%3DMABb7s94DJ0BA%40mail.gmail.com%3E

[2] https://gist.github.com/jinfengni/7f6df9ed7d2c761fed33

[3] https://gist.github.com/jinfengni/7460f6d250f0d00009ed

In list filter evaluation : room for improvement in run-time code generation.

Reply via email to