On Wed, Oct 12, 2022 at 4:35 PM Zhihong Yu <z...@yugabyte.com> wrote:
> > > On Wed, Oct 12, 2022 at 3:36 PM Lyu Pan <lyu.steve....@gmail.com> wrote: > >> Hello Zhihong Yu & Tomas Vondra, >> >> Thank you so much for your review and feedback! >> >> We made some updates based on previous feedback and attached the new >> patch set. Due to time constraints, we didn't get to resolve all the >> comments, and we'll continue to improve this patch. >> >> > In this prototype, the cost model is based on an assumption that there >> is >> > a linear relationship between the performance gain from using a semijoin >> > filter and the estimated filtering rate: >> > % improvement to Merge Join cost = 0.83 * estimated filtering rate - >> 0.137. >> > >> > How were the coefficients (0.83 and 0.137) determined ? >> > I guess they were based on the results of running certain workload. >> >> Right, the coefficients (0.83 and 0.137) determined are based on some >> preliminary testings. The current costing model is pretty naive and >> we'll work on a more robust costing model in future work. >> >> >> > I agree, in principle, although I think the current logic / formula is a >> > bit too crude and fitted to the simple data used in the test. I think >> > this needs to be formulated as a regular costing issue, considering >> > stuff like cost of the hash functions, and so on. >> > >> > I think this needs to do two things: >> > >> > 1) estimate the cost of building the bloom filter - This shall depend on >> > the number of rows in the inner relation, number/cost of the hash >> > functions (which may be higher for some data types), etc. >> > >> > 2) estimate improvement for the probing branch - Essentially, we need to >> > estimate how much we save by filtering some of the rows, but this also >> > neeeds to include the cost of probing the bloom filter. >> > >> > This will probably require some improvements to the lib/bloomfilter, in >> > order to estimate the false positive rate - this may matter a lot for >> > large data sets and small work_mem values. The bloomfilter library >> > simply reduces the size of the bloom filter, which increases the false >> > positive rate. At some point it'll start reducing the benefit. >> > >> >> These suggestions make a lot of sense. The current costing model is >> definitely not good enough, and we plan to work on a more robust >> costing model as we continue to improve the patch. >> >> >> > OK. Could also build the bloom filter in shared memory? >> > >> >> We thought about this approach but didn't prefer this one because if >> all worker processes share the same bloom filter in shared memory, we >> need to frequently lock and unlock the bloom filter to avoid race >> conditions. So we decided to have each worker process create its own >> bloom filter. >> >> >> > IMHO we shouldn't make too many conclusions from these examples. Yes, it >> > shows merge join can be improved, but for cases where a hashjoin works >> > better so we wouldn't use merge join anyway. >> > >> > I think we should try constructing examples where either merge join wins >> > already (and gets further improved by the bloom filter), or would lose >> > to hash join and the bloom filter improves it enough to win. >> > >> > AFAICS that requires a join of two large tables - large enough that hash >> > join would need to be batched, or pre-sorted inputs (which eliminates >> > the explicit Sort, which is the main cost in most cases). >> > >> > The current patch only works with sequential scans, which eliminates the >> > second (pre-sorted) option. So let's try the first one - can we invent >> > an example with a join of two large tables where a merge join would win? >> > >> > Can we find such example in existing benchmarks like TPC-H/TPC-DS. >> > >> >> Agreed. The current examples are only intended to show us that using >> bloom filters in merge join could improve the merge join performance >> in some cases. We are working on testing more examples that merge join >> with bloom filter could out-perform hash join, which should be more >> persuasive. >> >> >> > The bloom filter is built by the first seqscan (on t0), and then used by >> > the second seqscan (on t1). But this only works because we always run >> > the t0 scan to completion (because we're feeding it into Sort) before we >> > start scanning t1. >> > >> > But when the scan on t1 switches to an index scan, it's over - we'd be >> > building the filter without being able to probe it, and when we finish >> > building it we no longer need it. So this seems pretty futile. >> > >> > It might still improve plans like >> > >> > -> Merge Join >> > Merge Cond: (t0.c1 = t1.c1) >> > SemiJoin Filter Created Based on: (t0.c1 = t1.c1) >> > SemiJoin Estimated Filtering Rate: 1.0000 >> > -> Sort >> > Sort Key: t0.c1 >> > -> Seq Scan on t0 >> > -> Index Scan on t1 >> > >> > But I don't know how common/likely that actually is. I'd expect to have >> > an index on both sides, but perhaps I'm wrong. >> > >> > This is why hashjoin seems like a more natural fit for the bloom filter, >> > BTW, because there we have a guarantee the inner relation is processed >> > first (so we know the bloom filter is fine and can be probed). >> > >> >> Great observation. The bloom filter only works if the first SeqScan >> always runs to completion before the second SeqScan starts. >> I guess one possible way to avoid futile bloom filter might be >> enforcing that the bloom filter only be used if both the outer/inner >> plans of the MergeJoin are Sort nodes, to guarantee the bloom filter >> is ready to use after processing one side of the join, but this may be >> too restrictive. >> >> >> > I don't know what improvements you have in mind exactly, but I think >> > it'd be good to show which node is building/using a bloom filter, and >> > then also some basic stats (size, number of hash functions, FPR, number >> > of probes, ...). This may require improvements to lib/bloomfilter, which >> > currently does not expose some of the details. >> > >> >> Along with the new patch set, we have added information to display >> which node is building/using a bloom filter (as well as the >> corresponding expressions), and some bloom filter basic stats. We'll >> add more related information (e.g. FPR) as we modify lib/bloomfilter >> implementation in future work. >> >> >> Thanks again for the great reviews and they are really useful! More >> feedback is always welcome and appreciated! >> >> Regards, >> Lyu Pan >> Amazon RDS/Aurora for PostgreSQL >> >> Hi, > For v1-0001-Support-semijoin-filter-in-the-planner-optimizer.patch : > > + > + /* want at least 1000 rows_filtered to avoid any nasty edge cases > */ > + if (force_mergejoin_semijoin_filter || > + (filtering_rate >= 0.35 && rows_filtered > 1000 && > best_filter_clause >= 0)) > > Currently rows_filtered is compared with a constant, should the constant > be made configurable ? > > + * improvement of 19.5%. This equation also concludes thata a > 17% > > Typo: thata > > + inner_md_array = palloc(sizeof(SemiJoinFilterExprMetadata) * num_md); > + if (!outer_md_array || !inner_md_array) > + { > + return 0; /* a stack array allocation failed */ > > Should the allocated array be freed before returning ? > > For verify_valid_pushdown(), the parameters in comment don't match the > actual parameters. Please modify the comment. > > For is_fk_pk(), since the outer_key_list is fixed across the iterations, I > think all_vars can be computed outside of expressions_match_foreign_key(). > > Cheers > Continuing with v1-0001-Support-semijoin-filter-in-the-planner-optimizer.patch For get_switched_clauses(), the returned List contains all the clauses. Yet the name suggests that only switched clauses are returned. Please rename the method to adjust_XX or rearrange_YY so that it is less likely to cause confusion. For depth_of_semijoin_target(), idx is only used inside the `if (num_exprs == get_appendrel_occluded_references(` block, you can move its declaration inside that if block. + outside_subq_rte = root->simple_rte_array[outside_subq_var->varno]; + + /* System Vars have varattno < 0, don't bother */ + if (outside_subq_var->varattno <= 0) + return 0; Since the check on outside_subq_var->varattno doesn't depend on outside_subq_rte, you can move the assignment to outside_subq_rte after the if check. Cheers