[
https://issues.apache.org/jira/browse/DRILL-6727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605302#comment-16605302
]
weijie.tong commented on DRILL-6727:
------------------------------------
At current implementation , the foreman will wait for all the bloom filters
of one HashJoin operator to come in. Different HashJoin operator's bloom
filters are treat separately.
"Is it possible for the individual RUNTIME_FILTER operators to construct this
on the fly as they receive it? And apply it the moment they get it?" This is
really a great idea. I will fire another Jira to tune it. As we implement this
method, the probe side will receive and apply the bloom filter as soon as
possible. But there's still some reasonable cases that the build side's total
speed is lower or equal to the probe side one which are not appliable to JPPD
feature . Those cases should be detected and removed at the planning stage.
That is the DRILL-6573's responsibility.
To DRILL-6718 , I tried to download the tpch 100 scale parquet data from the
[address|http://drill-public.s3.amazonaws.com/tpch/sf100/parquet/tpch_sf100_parquet.tgz]
according to the test framework's guideline. But it always corrupted at the
downloading phase. Any other link could be supplied ?
> JPPD does not eliminate rows using the bloom filter if a HashJoin is involved
> -----------------------------------------------------------------------------
>
> Key: DRILL-6727
> URL: https://issues.apache.org/jira/browse/DRILL-6727
> Project: Apache Drill
> Issue Type: Improvement
> Components: Execution - Flow
> Affects Versions: 1.15.0
> Reporter: Kunal Khatua
> Assignee: weijie.tong
> Priority: Critical
> Attachments:
> bcastJoin-JPPD_2477fb99-36cb-9bc2-b7fb-c81a52b256d2.json,
> bcastJoin-default_2477fa68-a31e-3b97-5469-373845c2b763.json,
> hashJoin-JPPD_2477f6f7-14e0-ca23-d9f7-6b0273c20964.json,
> hashJoin-default_2477f5e8-fff2-fc83-d251-d8be8f92820b.json
>
>
> When testing a simple join between 2 tables, it appears that the Bloom-filter
> based predicate pushdown will work only for broadcast joins, but not for
> hash-based joins.
> Since the purpose of the filter is to reduce the number of records being
> hashed across the fragments, the runtime does not improve.
> Join Query (TPCH dataset):
> {code:sql}
> select
> l.l_orderkey
> , sum(l.l_extendedprice * (1 - l.l_discount)) as revenue
> , o.o_orderdate
> , o.o_shippriority
> from
> orders o
> , lineitem l
> where
> l.l_orderkey = o.o_orderkey
> and o.o_orderdate = date '1994-08-26'
> and MOD(o.o_custkey,10) = 1
> group by
> l.l_orderkey
> , o.o_orderdate
> , o.o_shippriority
> order by
> revenue desc
> , o.o_orderdate limit 10;
> {code}
> This generates an output of about 6K rows from the build side, with the
> expectation of 10M rows being joined from the probe side.
> Following are the results of the following query:
> || Join Mode || Profile || Runtime || Status ||
> |BCastJoin w/o JPPD |
> [^bcastJoin-default_2477fa68-a31e-3b97-5469-373845c2b763.json] | 3.148sec |
> As expected. 600M rows are scanned and probed against the locally available
> hash table. |
> |BCastJoin w/ JPPD |
> [^bcastJoin-JPPD_2477fb99-36cb-9bc2-b7fb-c81a52b256d2.json] | 3.570sec |
> 04-xx-06 shows a reduction in rows. 600M rows are scanned, but only 10M rows
> are probed against the locally available hash table. |
> |
> |HashJoin w/o JPPD |
> [^hashJoin-default_2477f5e8-fff2-fc83-d251-d8be8f92820b.json] | 5.861sec |
> As expected. 600M rows are scanned and probed against the hash table. |
> |HashJoin w/ JPPD |
> [^hashJoin-JPPD_2477f6f7-14e0-ca23-d9f7-6b0273c20964.json] | 8.376sec |
> 03-xx-07 is not seeing a reduction in rows. All 600M rows are scanned and
> probed against the hash table. |
> There are a few possibilities of why the RuntimeFilter does not eliminate any
> rows when a HashJoin is involved.
> 1. The RuntimeFilter operator does not have a bloom-filter
> 2. The RuntimeFilter receives the bloom-filter after the scan completes,
> because the foreman has not finished building and distributing the global
> bloom-filter
> 3. The RuntimeFilter receives the bloom-filter during the scan, but does not
> apply it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)