On Sun, Oct 2, 2022 at 6:40 AM Zhihong Yu <z...@yugabyte.com> wrote: > > > On Sat, Oct 1, 2022 at 12:45 AM Zhihong Yu <z...@yugabyte.com> wrote: > >> >> >> On Fri, Sep 30, 2022 at 9:20 PM Zhihong Yu <z...@yugabyte.com> wrote: >> >>> >>> >>> On Fri, Sep 30, 2022 at 8:40 PM Zhihong Yu <z...@yugabyte.com> wrote: >>> >>>> >>>> >>>> On Fri, Sep 30, 2022 at 3:44 PM Zheng Li <zhengl...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> A bloom filter provides early filtering of rows that cannot be joined >>>>> before they would reach the join operator, the optimization is also >>>>> called a semi join filter (SJF) pushdown. Such a filter can be created >>>>> when one child of the join operator must materialize its derived table >>>>> before the other child is evaluated. >>>>> >>>>> For example, a bloom filter can be created using the the join keys for >>>>> the build side/inner side of a hash join or the outer side of a merge >>>>> join, the bloom filter can then be used to pre-filter rows on the >>>>> other side of the join operator during the scan of the base relation. >>>>> The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good >>>>> discussion on using such optimization for hash join without going into >>>>> the pushdown of the filter where its performance gain could be further >>>>> increased. >>>>> >>>>> We worked on prototyping bloom filter pushdown for both hash join and >>>>> merge join. Attached is a patch set for bloom filter pushdown for >>>>> merge join. We also plan to send the patch for hash join once we have >>>>> it rebased. >>>>> >>>>> Here is a summary of the patch set: >>>>> 1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early >>>>> during the table scan instead of later on. >>>>> -The bloom filter is pushed down along the execution tree to >>>>> the target SeqScan nodes. >>>>> -Experiments show that this optimization can speed up Merge >>>>> Join by up to 36%. >>>>> >>>>> 2. The planner makes the decision to use the bloom filter based on the >>>>> estimated filtering rate and the expected performance gain. >>>>> -The planner accomplishes this by estimating four numbers per >>>>> variable - the total number of rows of the relation, the number of >>>>> distinct values for a given variable, and the minimum and maximum >>>>> value of the variable (when applicable). Using these numbers, the >>>>> planner estimates a filtering rate of a potential filter. >>>>> -Because actually creating and implementing the filter adds >>>>> more operations, there is a minimum threshold of filtering where the >>>>> filter would actually be useful. Based on testing, we query to see if >>>>> the estimated filtering rate is higher than 35%, and that informs our >>>>> decision to use a filter or not. >>>>> >>>>> 3. If using a bloom filter, the planner also adjusts the expected cost >>>>> of Merge Join based on expected performance gain. >>>>> >>>>> 4. Capability to build the bloom filter in parallel in case of >>>>> parallel SeqScan. This is done efficiently by populating a local bloom >>>>> filter for each parallel worker and then taking a bitwise OR over all >>>>> the local bloom filters to form a shared bloom filter at the end of >>>>> the parallel SeqScan. >>>>> >>>>> 5. The optimization is GUC controlled, with settings of >>>>> enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter. >>>>> >>>>> We found in experiments that there is a significant improvement >>>>> when using the bloom filter during Merge Join. One experiment involved >>>>> joining two large tables while varying the theoretical filtering rate >>>>> (TFR) between the two tables, the TFR is defined as the percentage >>>>> that the two datasets are disjoint. Both tables in the merge join were >>>>> the same size. We tested changing the TFR to see the change in >>>>> filtering optimization. >>>>> >>>>> For example, let’s imagine t0 has 10 million rows, which contain the >>>>> numbers 1 through 10 million randomly shuffled. Also, t1 has the >>>>> numbers 4 million through 14 million randomly shuffled. Then the TFR >>>>> for a join of these two tables is 40%, since 40% of the tables are >>>>> disjoint from the other table (1 through 4 million for t0, 10 million >>>>> through 14 million for t4). >>>>> >>>>> Here is the performance test result joining two tables: >>>>> TFR: theoretical filtering rate >>>>> EFR: estimated filtering rate >>>>> AFR: actual filtering rate >>>>> HJ: hash join >>>>> MJ Default: default merge join >>>>> MJ Filter: merge join with bloom filter optimization enabled >>>>> MJ Filter Forced: merge join with bloom filter optimization forced >>>>> >>>>> TFR EFR AFR HJ MJ Default MJ Filter MJ Filter Forced >>>>> >>>>> ------------------------------------------------------------------------------------- >>>>> 10 33.46 7.41 6529 22638 21949 23160 >>>>> 20 37.27 14.85 6483 22290 21928 21930 >>>>> 30 41.32 22.25 6395 22374 20718 20794 >>>>> 40 45.67 29.7 6272 21969 19449 19410 >>>>> 50 50.41 37.1 6210 21412 18222 18224 >>>>> 60 55.64 44.51 6052 21108 17060 17018 >>>>> 70 61.59 51.98 5947 21020 15682 15737 >>>>> 80 68.64 59.36 5761 20812 14411 14437 >>>>> 90 77.83 66.86 5701 20585 13171 13200 >>>>> Table. Execution Time (ms) vs Filtering Rate (%) for Joining Two >>>>> Tables of 10M Rows. >>>>> >>>>> Attached you can find figures of the same performance test and a SQL >>>>> script >>>>> to reproduce the performance test. >>>>> >>>>> The first thing to notice is that Hash Join generally is the most >>>>> efficient join strategy. This is because Hash Join is better at >>>>> dealing with small tables, and our size of 10 million is still small >>>>> enough where Hash Join outperforms the other join strategies. Future >>>>> experiments can investigate using much larger tables. >>>>> >>>>> However, comparing just within the different Merge Join variants, we >>>>> see that using the bloom filter greatly improves performance. >>>>> Intuitively, all of these execution times follow linear paths. >>>>> Comparing forced filtering versus default, we can see that the default >>>>> Merge Join outperforms Merge Join with filtering at low filter rates, >>>>> but after about 20% TFR, the Merge Join with filtering outperforms >>>>> default Merge Join. This makes intuitive sense, as there are some >>>>> fixed costs associated with building and checking with the bloom >>>>> filter. In the worst case, at only 10% TFR, the bloom filter makes >>>>> Merge Join less than 5% slower. However, in the best case, at 90% TFR, >>>>> the bloom filter improves Merge Join by 36%. >>>>> >>>>> Based on the results of the above experiments, we came up with a >>>>> linear equation for the performance ratio for using the filter >>>>> pushdown from the actual filtering rate. Based on the numbers >>>>> presented in the figure, this is the equation: >>>>> >>>>> T_filter / T_no_filter = 1 / (0.83 * estimated filtering rate + 0.863) >>>>> >>>>> For example, this means that with an estimated filtering rate of 0.4, >>>>> the execution time of merge join is estimated to be improved by 16.3%. >>>>> Note that the estimated filtering rate is used in the equation, not >>>>> the theoretical filtering rate or the actual filtering rate because it >>>>> is what we have during planning. In practice the estimated filtering >>>>> rate isn’t usually accurate. In fact, the estimated filtering rate can >>>>> differ from the theoretical filtering rate by as much as 17% in our >>>>> experiments. One way to mitigate the power loss of bloom filter caused >>>>> by inaccurate estimated filtering rate is to adaptively turn it off at >>>>> execution time, this is yet to be implemented. >>>>> >>>>> Here is a list of tasks we plan to work on in order to improve this >>>>> patch: >>>>> 1. More regression testing to guarantee correctness. >>>>> 2. More performance testing involving larger tables and complicated >>>>> query plans. >>>>> 3. Improve the cost model. >>>>> 4. Explore runtime tuning such as making the bloom filter checking >>>>> adaptive. >>>>> 5. Currently, only the best single join key is used for building the >>>>> Bloom filter. However, if there are several keys and we know that >>>>> their distributions are somewhat disjoint, we could leverage this fact >>>>> and use multiple keys for the bloom filter. >>>>> 6. Currently, Bloom filter pushdown is only implemented for SeqScan >>>>> nodes. However, it would be possible to allow push down to other types >>>>> of scan nodes. >>>>> 7. Explore if the Bloom filter could be pushed down through a foreign >>>>> scan when the foreign server is capable of handling it – which could >>>>> be made true for postgres_fdw. >>>>> 8. Better explain command on the usage of bloom filters. >>>>> >>>>> This patch set is prepared by Marcus Ma, Lyu Pan and myself. Feedback >>>>> is appreciated. >>>>> >>>>> With Regards, >>>>> Zheng Li >>>>> Amazon RDS/Aurora for PostgreSQL >>>>> >>>>> [1] >>>>> https://www.postgresql.org/message-id/flat/c902844d-837f-5f63-ced3-9f7fd222f175%402ndquadrant.com >>>> >>>> >>>> Hi, >>>> In the header of patch 1: >>>> >>>> In this prototype, the cost model is based on an assumption that there >>>> is a linear relationship between the performance gain from using a semijoin >>>> filter and the estimated filtering rate: >>>> % improvement to Merge Join cost = 0.83 * estimated filtering rate - >>>> 0.137. >>>> >>>> How were the coefficients (0.83 and 0.137) determined ? >>>> I guess they were based on the results of running certain workload. >>>> >>>> Cheers >>>> >>> Hi, >>> For patch 1: >>> >>> +bool enable_mergejoin_semijoin_filter; >>> +bool force_mergejoin_semijoin_filter; >>> >>> How would (enable_mergejoin_semijoin_filter = off, >>> force_mergejoin_semijoin_filter = on) be interpreted ? >>> Have you considered using one GUC which has three values: off, enabled, >>> forced ? >>> >>> + mergeclauses_for_sjf = >>> get_actual_clauses(path->path_mergeclauses); >>> + mergeclauses_for_sjf = >>> get_switched_clauses(path->path_mergeclauses, >>> + >>> path->jpath.outerjoinpath->parent->relids); >>> >>> mergeclauses_for_sjf is assigned twice and I don't >>> see mergeclauses_for_sjf being reference in the call >>> to get_switched_clauses(). >>> Is this intentional ? >>> >>> + /* want at least 1000 rows_filtered to avoid any nasty edge >>> cases */ >>> + if (force_mergejoin_semijoin_filter || (filteringRate >= >>> 0.35 && rows_filtered > 1000)) >>> >>> The above condition is narrower compared to the enclosing condition. >>> Since there is no else block for the second if block, please merge the >>> two if statements. >>> >>> + int best_filter_clause; >>> >>> Normally I would think `clause` is represented by List*. But >>> best_filter_clause is an int. Please use another variable name so that >>> there is less chance of confusion. >>> >>> For evaluate_semijoin_filtering_rate(): >>> >>> + double best_sj_selectivity = 1.01; >>> >>> How was 1.01 determined ? >>> >>> + debug_sj1("SJPD: start evaluate_semijoin_filtering_rate"); >>> >>> There are debug statements in the methods. >>> It would be better to remove them in the next patch set. >>> >>> Cheers >>> >> Hi, >> Still patch 1. >> >> + if (!outer_arg_md->is_or_maps_to_base_column >> + && !inner_arg_md->is_or_maps_to_constant) >> + { >> + debug_sj2("SJPD: outer equijoin arg does not map %s", >> + "to a base column nor a constant; semijoin is not >> valid"); >> >> Looks like there is a typo: inner_arg_md->is_or_maps_to_constant should >> be outer_arg_md->is_or_maps_to_constant >> >> + if (outer_arg_md->est_col_width > MAX_SEMIJOIN_SINGLE_KEY_WIDTH) >> + { >> + debug_sj2("SJPD: outer equijoin column's width %s", >> + "was excessive; condition rejected"); >> >> How is the value of MAX_SEMIJOIN_SINGLE_KEY_WIDTH determined ? >> >> For verify_valid_pushdown(): >> >> + Assert(path); >> + Assert(target_var_no > 0); >> + >> + if (path == NULL) >> + { >> + return false; >> >> I don't understand the first assertion. Does it mean path would always be >> non-NULL ? Then the if statement should be dropped. >> >> + if (path->parent->relid == target_var_no) >> + { >> + /* >> + * Found source of target var! We know that the >> pushdown >> + * is valid now. >> + */ >> + return true; >> + } >> + return false; >> >> The above can be simplified as: return path->parent->relid == >> target_var_no; >> >> + * True if the given con_exprs, ref_exprs and operators will exactlty >> >> Typo: exactlty -> exactly >> >> + if (!bms_equal(all_vars, matched_vars)) >> + return false; >> + return true; >> >> The above can be simplified as: return bms_equal(all_vars, matched_vars); >> >> Cheers >> > Hi, > Still in patch 1 :-) > > + if (best_path->use_semijoinfilter) > + { > + if (best_path->best_mergeclause != -1) > > Since there is no else block, the two conditions can be combined. > > + ListCell *clause_cell = list_nth_cell(mergeclauses, > best_path->best_mergeclause); > > As shown in the above code, best_mergeclause is the position of the best > merge clause in mergeclauses. > I think best_mergeclause_pos (or similar name) is more appropriate for the > fieldname. > > For depth_of_semijoin_target(): > > + * Parameters: > + * node: plan node to be considered for semijoin push down. > > The name of the parameter is pn - please align the comment with code. > > For T_SubqueryScan case in depth_of_semijoin_target(): > > + Assert(rte->subquery->targetList); > ... > + if (rel && rel->subroot > + && rte && rte->subquery && rte->subquery->targetList) > > It seems the condition can be simplified since rte->subquery->targetList > has passed the assertion. > > For is_table_scan_node_source_of_relids_or_var(), the else block can be > simplified to returning scan_node_varno == target_var->varno directly. > > For get_appendrel_occluded_references(): > > + * Given a virtual column from an Union ALL subquery, > + * return the expression it immediately occludes that satisfy > > Since the index is returned from the func, it would be better to clarify > the comment by saying `return the last index of expression ...` > > + /* Subquery without append and partitioned tables */ > > append and partitioned tables -> append or partitioned tables > > More reviews for subsequent patches to follow. > Hi, For 0002-Support-semijoin-filter-in-the-executor-for-non-para.patch ,
+ if (!qual && !projInfo && !IsA(node, SeqScanState) && + !((SeqScanState *) node)->applySemiJoinFilter) I am confused by the last two clauses in the condition. If !IsA(node, SeqScanState) is true, why does the last clause cast node to SeqScanState * ? I think you forgot to put the last two clauses in a pair of parentheses. + /* slot did not pass SemiJoinFilter, so skipping it. */ skipping it -> skip it + /* double row estimate to reduce error rate for Bloom filter */ + *nodeRows = Max(*nodeRows, scan->ss.ps.plan->plan_rows * 2); Probably add more comment above about why the row count is doubled and how the error rate is reduced. +SemiJoinFilterExamineSlot(List *semiJoinFilters, TupleTableSlot *slot, Oid tableId) SemiJoinFilterExamineSlot -> ExamineSlotUsingSemiJoinFilter Cheers