Re: Bloom filter Pushdown Optimization for Merge Join

Zhihong Yu Sun, 02 Oct 2022 07:43:37 -0700

On Sun, Oct 2, 2022 at 6:40 AM Zhihong Yu <z...@yugabyte.com> wrote:

>
>
> On Sat, Oct 1, 2022 at 12:45 AM Zhihong Yu <z...@yugabyte.com> wrote:
>
>>
>>
>> On Fri, Sep 30, 2022 at 9:20 PM Zhihong Yu <z...@yugabyte.com> wrote:
>>
>>>
>>>
>>> On Fri, Sep 30, 2022 at 8:40 PM Zhihong Yu <z...@yugabyte.com> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Sep 30, 2022 at 3:44 PM Zheng Li <zhengl...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> A bloom filter provides early filtering of rows that cannot be joined
>>>>> before they would reach the join operator, the optimization is also
>>>>> called a semi join filter (SJF) pushdown. Such a filter can be created
>>>>> when one child of the join operator must materialize its derived table
>>>>> before the other child is evaluated.
>>>>>
>>>>> For example, a bloom filter can be created using the the join keys for
>>>>> the build side/inner side of a hash join or the outer side of a merge
>>>>> join, the bloom filter can then be used to pre-filter rows on the
>>>>> other side of the join operator during the scan of the base relation.
>>>>> The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good
>>>>> discussion on using such optimization for hash join without going into
>>>>> the pushdown of the filter where its performance gain could be further
>>>>> increased.
>>>>>
>>>>> We worked on prototyping bloom filter pushdown for both hash join and
>>>>> merge join. Attached is a patch set for bloom filter pushdown for
>>>>> merge join. We also plan to send the patch for hash join once we have
>>>>> it rebased.
>>>>>
>>>>> Here is a summary of the patch set:
>>>>> 1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early
>>>>> during the table scan instead of later on.
>>>>>         -The bloom filter is pushed down along the execution tree to
>>>>> the target SeqScan nodes.
>>>>>         -Experiments show that this optimization can speed up Merge
>>>>> Join by up to 36%.
>>>>>
>>>>> 2. The planner makes the decision to use the bloom filter based on the
>>>>> estimated filtering rate and the expected performance gain.
>>>>>         -The planner accomplishes this by estimating four numbers per
>>>>> variable - the total number of rows of the relation, the number of
>>>>> distinct values for a given variable, and the minimum and maximum
>>>>> value of the variable (when applicable). Using these numbers, the
>>>>> planner estimates a filtering rate of a potential filter.
>>>>>         -Because actually creating and implementing the filter adds
>>>>> more operations, there is a minimum threshold of filtering where the
>>>>> filter would actually be useful. Based on testing, we query to see if
>>>>> the estimated filtering rate is higher than 35%, and that informs our
>>>>> decision to use a filter or not.
>>>>>
>>>>> 3. If using a bloom filter, the planner also adjusts the expected cost
>>>>> of Merge Join based on expected performance gain.
>>>>>
>>>>> 4. Capability to build the bloom filter in parallel in case of
>>>>> parallel SeqScan. This is done efficiently by populating a local bloom
>>>>> filter for each parallel worker and then taking a bitwise OR over all
>>>>> the local bloom filters to form a shared bloom filter at the end of
>>>>> the parallel SeqScan.
>>>>>
>>>>> 5. The optimization is GUC controlled, with settings of
>>>>> enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter.
>>>>>
>>>>> We found in experiments that there is a significant improvement
>>>>> when using the bloom filter during Merge Join. One experiment involved
>>>>> joining two large tables while varying the theoretical filtering rate
>>>>> (TFR) between the two tables, the TFR is defined as the percentage
>>>>> that the two datasets are disjoint. Both tables in the merge join were
>>>>> the same size. We tested changing the TFR to see the change in
>>>>> filtering optimization.
>>>>>
>>>>> For example, let’s imagine t0 has 10 million rows, which contain the
>>>>> numbers 1 through 10 million randomly shuffled. Also, t1 has the
>>>>> numbers 4 million through 14 million randomly shuffled. Then the TFR
>>>>> for a join of these two tables is 40%, since 40% of the tables are
>>>>> disjoint from the other table (1 through 4 million for t0, 10 million
>>>>> through 14 million for t4).
>>>>>
>>>>> Here is the performance test result joining two tables:
>>>>> TFR: theoretical filtering rate
>>>>> EFR: estimated filtering rate
>>>>> AFR: actual filtering rate
>>>>> HJ: hash join
>>>>> MJ Default: default merge join
>>>>> MJ Filter: merge join with bloom filter optimization enabled
>>>>> MJ Filter Forced: merge join with bloom filter optimization forced
>>>>>
>>>>> TFR   EFR   AFR   HJ   MJ Default   MJ Filter   MJ Filter Forced
>>>>>
>>>>> -------------------------------------------------------------------------------------
>>>>> 10     33.46   7.41    6529   22638        21949            23160
>>>>> 20     37.27  14.85   6483   22290        21928            21930
>>>>> 30     41.32   22.25  6395   22374        20718            20794
>>>>> 40     45.67   29.7    6272   21969        19449            19410
>>>>> 50     50.41   37.1    6210   21412        18222            18224
>>>>> 60     55.64   44.51  6052   21108        17060            17018
>>>>> 70     61.59   51.98  5947   21020        15682            15737
>>>>> 80     68.64   59.36  5761   20812        14411            14437
>>>>> 90     77.83   66.86  5701   20585        13171            13200
>>>>> Table. Execution Time (ms) vs Filtering Rate (%) for Joining Two
>>>>> Tables of 10M Rows.
>>>>>
>>>>> Attached you can find figures of the same performance test and a SQL
>>>>> script
>>>>> to reproduce the performance test.
>>>>>
>>>>> The first thing to notice is that Hash Join generally is the most
>>>>> efficient join strategy. This is because Hash Join is better at
>>>>> dealing with small tables, and our size of 10 million is still small
>>>>> enough where Hash Join outperforms the other join strategies. Future
>>>>> experiments can investigate using much larger tables.
>>>>>
>>>>> However, comparing just within the different Merge Join variants, we
>>>>> see that using the bloom filter greatly improves performance.
>>>>> Intuitively, all of these execution times follow linear paths.
>>>>> Comparing forced filtering versus default, we can see that the default
>>>>> Merge Join outperforms Merge Join with filtering at low filter rates,
>>>>> but after about 20% TFR, the Merge Join with filtering outperforms
>>>>> default Merge Join. This makes intuitive sense, as there are some
>>>>> fixed costs associated with building and checking with the bloom
>>>>> filter. In the worst case, at only 10% TFR, the bloom filter makes
>>>>> Merge Join less than 5% slower. However, in the best case, at 90% TFR,
>>>>> the bloom filter improves Merge Join by 36%.
>>>>>
>>>>> Based on the results of the above experiments, we came up with a
>>>>> linear equation for the performance ratio for using the filter
>>>>> pushdown from the actual filtering rate. Based on the numbers
>>>>> presented in the figure, this is the equation:
>>>>>
>>>>> T_filter / T_no_filter = 1 / (0.83 * estimated filtering rate + 0.863)
>>>>>
>>>>> For example, this means that with an estimated filtering rate of 0.4,
>>>>> the execution time of merge join is estimated to be improved by 16.3%.
>>>>> Note that the estimated filtering rate is used in the equation, not
>>>>> the theoretical filtering rate or the actual filtering rate because it
>>>>> is what we have during planning. In practice the estimated filtering
>>>>> rate isn’t usually accurate. In fact, the estimated filtering rate can
>>>>> differ from the theoretical filtering rate by as much as 17% in our
>>>>> experiments. One way to mitigate the power loss of bloom filter caused
>>>>> by inaccurate estimated filtering rate is to adaptively turn it off at
>>>>> execution time, this is yet to be implemented.
>>>>>
>>>>> Here is a list of tasks we plan to work on in order to improve this
>>>>> patch:
>>>>> 1. More regression testing to guarantee correctness.
>>>>> 2. More performance testing involving larger tables and complicated
>>>>> query plans.
>>>>> 3. Improve the cost model.
>>>>> 4. Explore runtime tuning such as making the bloom filter checking
>>>>> adaptive.
>>>>> 5. Currently, only the best single join key is used for building the
>>>>> Bloom filter. However, if there are several keys and we know that
>>>>> their distributions are somewhat disjoint, we could leverage this fact
>>>>> and use multiple keys for the bloom filter.
>>>>> 6. Currently, Bloom filter pushdown is only implemented for SeqScan
>>>>> nodes. However, it would be possible to allow push down to other types
>>>>> of scan nodes.
>>>>> 7. Explore if the Bloom filter could be pushed down through a foreign
>>>>> scan when the foreign server is capable of handling it – which could
>>>>> be made true for postgres_fdw.
>>>>> 8. Better explain command on the usage of bloom filters.
>>>>>
>>>>> This patch set is prepared by Marcus Ma, Lyu Pan and myself. Feedback
>>>>> is appreciated.
>>>>>
>>>>> With Regards,
>>>>> Zheng Li
>>>>> Amazon RDS/Aurora for PostgreSQL
>>>>>
>>>>> [1]
>>>>> https://www.postgresql.org/message-id/flat/c902844d-837f-5f63-ced3-9f7fd222f175%402ndquadrant.com
>>>>
>>>>
>>>> Hi,
>>>> In the header of patch 1:
>>>>
>>>> In this prototype, the cost model is based on an assumption that there
>>>> is a linear relationship between the performance gain from using a semijoin
>>>> filter and the estimated filtering rate:
>>>> % improvement to Merge Join cost = 0.83 * estimated filtering rate -
>>>> 0.137.
>>>>
>>>> How were the coefficients (0.83 and 0.137) determined ?
>>>> I guess they were based on the results of running certain workload.
>>>>
>>>> Cheers
>>>>
>>> Hi,
>>> For patch 1:
>>>
>>> +bool       enable_mergejoin_semijoin_filter;
>>> +bool       force_mergejoin_semijoin_filter;
>>>
>>> How would (enable_mergejoin_semijoin_filter = off,
>>> force_mergejoin_semijoin_filter = on) be interpreted ?
>>> Have you considered using one GUC which has three values: off, enabled,
>>> forced ?
>>>
>>> +       mergeclauses_for_sjf =
>>> get_actual_clauses(path->path_mergeclauses);
>>> +       mergeclauses_for_sjf =
>>> get_switched_clauses(path->path_mergeclauses,
>>> +
>>> path->jpath.outerjoinpath->parent->relids);
>>>
>>> mergeclauses_for_sjf is assigned twice and I don't
>>> see mergeclauses_for_sjf being reference in the call
>>> to get_switched_clauses().
>>> Is this intentional ?
>>>
>>> +           /* want at least 1000 rows_filtered to avoid any nasty edge
>>> cases */
>>> +           if (force_mergejoin_semijoin_filter || (filteringRate >=
>>> 0.35 && rows_filtered > 1000))
>>>
>>> The above condition is narrower compared to the enclosing condition.
>>> Since there is no else block for the second if block, please merge the
>>> two if statements.
>>>
>>> +   int         best_filter_clause;
>>>
>>> Normally I would think `clause` is represented by List*. But
>>> best_filter_clause is an int. Please use another variable name so that
>>> there is less chance of confusion.
>>>
>>> For evaluate_semijoin_filtering_rate():
>>>
>>> +   double      best_sj_selectivity = 1.01;
>>>
>>> How was 1.01 determined ?
>>>
>>> +   debug_sj1("SJPD:  start evaluate_semijoin_filtering_rate");
>>>
>>> There are debug statements in the methods.
>>> It would be better to remove them in the next patch set.
>>>
>>> Cheers
>>>
>> Hi,
>> Still patch 1.
>>
>> +       if (!outer_arg_md->is_or_maps_to_base_column
>> +           && !inner_arg_md->is_or_maps_to_constant)
>> +       {
>> +           debug_sj2("SJPD:        outer equijoin arg does not map %s",
>> +                     "to a base column nor a constant; semijoin is not
>> valid");
>>
>> Looks like there is a typo:  inner_arg_md->is_or_maps_to_constant should
>> be outer_arg_md->is_or_maps_to_constant
>>
>> +       if (outer_arg_md->est_col_width > MAX_SEMIJOIN_SINGLE_KEY_WIDTH)
>> +       {
>> +           debug_sj2("SJPD:        outer equijoin column's width %s",
>> +                     "was excessive; condition rejected");
>>
>> How is the value of MAX_SEMIJOIN_SINGLE_KEY_WIDTH determined ?
>>
>> For verify_valid_pushdown():
>>
>> +   Assert(path);
>> +   Assert(target_var_no > 0);
>> +
>> +   if (path == NULL)
>> +   {
>> +       return false;
>>
>> I don't understand the first assertion. Does it mean path would always be
>> non-NULL ? Then the if statement should be dropped.
>>
>> +               if (path->parent->relid == target_var_no)
>> +               {
>> +                   /*
>> +                    * Found source of target var! We know that the
>> pushdown
>> +                    * is valid now.
>> +                    */
>> +                   return true;
>> +               }
>> +               return false;
>>
>> The above can be simplified as: return path->parent->relid ==
>> target_var_no;
>>
>> + *     True if the given con_exprs, ref_exprs and operators will exactlty
>>
>> Typo: exactlty -> exactly
>>
>> +   if (!bms_equal(all_vars, matched_vars))
>> +       return false;
>> +   return true;
>>
>> The above can be simplified as: return bms_equal(all_vars, matched_vars);
>>
>> Cheers
>>
> Hi,
> Still in patch 1 :-)
>
> +   if (best_path->use_semijoinfilter)
> +   {
> +       if (best_path->best_mergeclause != -1)
>
> Since there is no else block, the two conditions can be combined.
>
> +           ListCell   *clause_cell = list_nth_cell(mergeclauses,
> best_path->best_mergeclause);
>
> As shown in the above code, best_mergeclause is the position of the best
> merge clause in mergeclauses.
> I think best_mergeclause_pos (or similar name) is more appropriate for the
> fieldname.
>
> For depth_of_semijoin_target():
>
> + *  Parameters:
> + *  node: plan node to be considered for semijoin push down.
>
> The name of the parameter is pn - please align the comment with code.
>
> For T_SubqueryScan case in depth_of_semijoin_target():
>
> +               Assert(rte->subquery->targetList);
> ...
> +               if (rel && rel->subroot
> +                   && rte && rte->subquery && rte->subquery->targetList)
>
> It seems the condition can be simplified since rte->subquery->targetList
> has passed the assertion.
>
> For is_table_scan_node_source_of_relids_or_var(), the else block can be
> simplified to returning scan_node_varno == target_var->varno directly.
>
> For get_appendrel_occluded_references():
>
> + *  Given a virtual column from an Union ALL subquery,
> + *  return the expression it immediately occludes that satisfy
>
> Since the index is returned from the func, it would be better to clarify
> the comment by saying `return the last index of expression ...`
>
> +   /* Subquery without append and partitioned tables */
>
> append and partitioned tables -> append or partitioned tables
>
> More reviews for subsequent patches to follow.
>
Hi,
For 0002-Support-semijoin-filter-in-the-executor-for-non-para.patch ,


+   if (!qual && !projInfo && !IsA(node, SeqScanState) &&
+       !((SeqScanState *) node)->applySemiJoinFilter)

I am confused by the last two clauses in the condition. If !IsA(node,
SeqScanState) is true, why does the last clause cast node to SeqScanState *
?
I think you forgot to put the last two clauses in a pair of parentheses.

+                       /* slot did not pass SemiJoinFilter, so skipping
it. */

skipping it -> skip it

+           /* double row estimate to reduce error rate for Bloom filter */
+           *nodeRows = Max(*nodeRows, scan->ss.ps.plan->plan_rows * 2);

Probably add more comment above about why the row count is doubled and how
the error rate is reduced.

+SemiJoinFilterExamineSlot(List *semiJoinFilters, TupleTableSlot *slot, Oid
tableId)

SemiJoinFilterExamineSlot -> ExamineSlotUsingSemiJoinFilter

Cheers

Re: Bloom filter Pushdown Optimization for Merge Join

Reply via email to