Apologies for submitting new patches while we're still in the midst of CommitFest:November, but I think I've more or less come to the end of the reviewing that I can usefully do for 8.4 (or at least, I haven't had any requests lately, feel free...) and I don't want to wait forever to start thinking about 8.5.
This patch is based on Simon Riggs' earlier work on join removal, although I don't believe I've actually used any of his code. Tom Lane's review comments on Simon's code were invaluable in getting this to work. I also understand that Simon plans to do more work in this area, so if this ends up getting sucked into his work or visca versa (or this ends up getting superseded altogether), that is fine. http://archives.postgresql.org/pgsql-patches/2008-08/msg00035.php http://archives.postgresql.org/pgsql-patches/2008-09/msg00024.php The theoretical underpinning of this patch is as follows: Joins can affect the result of a query in one of two ways: either they change the set of rows returned (by either increasing it or decreasing it), or they provide attributes. To remove a join, we need to prove that neither of these things is possible. It's pretty easy to test that the join isn't providing any attributes above the level of the join, and you'll see that rel_attrs_needed_above_join() handles that here. Verifying that the number of rows returned isn't changed by the join is harder. For the sake of simplicity, this patch only attempts to handle LEFT JOINs. Specifically, it attempts to prove that the results of scanning the outer side of the joinrel will be identical to the results of performing a merge-join, up to sorting. Nothing that happens on the inner side of the join can reduce the number of output rows (except via a WHERE-clause qual that will fail the rel_attrs_needed_above_join test anyway), so the only thing we need to worry about is the possibility that the join might increase the output row count. To do that, we need to know that there will be a sufficiently strong unique-ness property on the rows emitted by the inner side of the join. I believe that there are two main cases in which we could potentially prove this: the inner side of the join is an index scan (without resort) on a unique index, or the inner side of the join is a subselect with a group by clause over the join columns. For now, I'm only attempting to handle the first case, though it might be nice to eventually handle the other as well. Simon's original patch starts by checking whether the RelOptInfo for the relation to be dropped is a base relation, but that approach seems somewhat limiting, because we certainly want to be able to remove multiple joins in succession. Setting the paths for the joinrel {B C} to the paths for B doesn't help us when we try to form {A B C}, because now the join of {A} to {B C} is not a join against a base rel and join removal fails. The other problem is that at that level we don't really understand what's up with the join clauses: is the join operator even an equality operator? There is finally the problem that while we can get at the relevant IndexOptInfo structures and figure out which combinations of attributes have unique indices, I can't figure out any sane way to relate those attribute numbers back to the join clauses. So I've taken the alternative approach of working on the level of paths. sort_outer_and_inner(), before forming a join path, checks whether there happens to be a unique index with the same pathkeys as the inner side of the path. If so, and if no attributes are needed from the inner side of the join, the it pulls up all the paths from the outerrel into the joinrel and exits (really, it should skip the rest of add_paths_to_joinrel() as well, but it didn't seem worth refactoring that without some validation of the basic approach). This approach is limiting because sort_outer_and_inner() doesn't try every possible combination of pathkeys that might be helpful to us, but it seems likely to be adequate here for the same reasons that sort_outer_and_inner() gets away with it generally: the list of mergeclauses tends to be real short. (I think that if we were trying to optimize away an INNER join, this approach would be inadequate, because any non-merge-joinable join clauses could change the number of output rows. Here that can't happen: the most they can do is force the inner side to NULL, but if the inner side isn't being used that doesn't matter anyway. Of course, for INNER JOINs, we'd also need to worry about the merge-joinable clauses themselves stripping rows from the output due to a failure to match; we'd need to check for foreign key constraints.) I have a feeling there are probably lingering bugs in here, but it passes regression tests and some sample queries I tried still return the correct answers even after nuking one or more joins. Any comments appreciated, ...Robert
*** a/src/backend/optimizer/path/costsize.c --- b/src/backend/optimizer/path/costsize.c *************** *** 108,113 **** bool enable_hashagg = true; --- 108,114 ---- bool enable_nestloop = true; bool enable_mergejoin = true; bool enable_hashjoin = true; + bool enable_joinremoval = true; typedef struct { *** a/src/backend/optimizer/path/joinpath.c --- b/src/backend/optimizer/path/joinpath.c *************** *** 41,46 **** static List *select_mergejoin_clauses(PlannerInfo *root, --- 41,49 ---- RelOptInfo *innerrel, List *restrictlist, JoinType jointype); + static bool rel_attrs_needed_above_join(RelOptInfo *rel, RelOptInfo *joinrel); + static bool test_join_removal(RelOptInfo *joinrel, RelOptInfo *rel, + List *inner_pathkeys_for_merge, bool *removejoinOK); /* *************** *** 158,163 **** sort_inner_and_outer(PlannerInfo *root, --- 161,167 ---- SpecialJoinInfo *sjinfo) { bool useallclauses; + bool removejoinOK = false; Path *outer_path; Path *inner_path; List *all_pathkeys; *************** *** 169,176 **** sort_inner_and_outer(PlannerInfo *root, */ switch (jointype) { - case JOIN_INNER: case JOIN_LEFT: case JOIN_SEMI: case JOIN_ANTI: case JOIN_UNIQUE_OUTER: --- 173,181 ---- */ switch (jointype) { case JOIN_LEFT: + removejoinOK = enable_joinremoval; + case JOIN_INNER: case JOIN_SEMI: case JOIN_ANTI: case JOIN_UNIQUE_OUTER: *************** *** 276,281 **** sort_inner_and_outer(PlannerInfo *root, --- 281,295 ---- cur_mergeclauses, outerkeys); + /* Join removal. */ + if (removejoinOK + && test_join_removal(joinrel, innerrel, innerkeys, &removejoinOK)) + { + add_paths_from_list(joinrel, outerrel->pathlist); + /* No point in considering any more join paths. */ + return; + } + /* Build pathkeys representing output sort order */ merge_pathkeys = build_join_pathkeys(root, joinrel, jointype, outerkeys); *************** *** 1048,1050 **** select_mergejoin_clauses(PlannerInfo *root, --- 1062,1196 ---- return result_list; } + + /* + * test_join_removal + * Attempt to prove that joining against the inner relation can't change + * the number of output rows. + */ + static bool + test_join_removal(RelOptInfo *joinrel, RelOptInfo *innerrel, + List *inner_pathkeys_for_merge, bool *removejoinOK) + { + bool any_hope_left = false; + bool useful_index_found = false; + ListCell *l; + + /* + * We're basically looking for an index-scan path over a unique index on a + * suitable set of columns. In the future, we might want to consider the + * case of a LEFT JOIN against a grouped subselect where the grouped + * columns are mergejoinable: + * SELECT c.* FROM customer c + * LEFT JOIN (SELECT customer_id, SUM(1) FROM customer_order + * GROUP BY 1) co ON c.id = co.customer_id + */ + foreach(l, innerrel->pathlist) + { + Path *path = lfirst(l); + /* + * Since we're only handling LEFT joins for now, we don't need to + * worry about whether the inner side of the join reduces the number + * of output rows. In general, that can only happen if there is a + * qual on the inner rel - but the rel_attrs_needed_above_join test + * will catch that case. + * + * We do need to worry about the join increasing the number of output + * rows. It suffices to find a unique index on all of the pathkeys for + * ANY possible way of doing the merge join. In other words, the + * presence of non-merge-joinable join clauses isn't a problem - and + * we don't need to use all of the clauses, either. In such cases, + * the unique index we're looking for will be stronger than necessary, + * but that doesn't matter (since this is an outer join). + * + * This algorithm can miss a legal join removal if the set of pathkeys + * that corresponds to the ordering of the relevant index is never + * tried, but it doesn't seem worth spending too much time worrying + * about that case for the same reasons mentioned in + * sort_outer_and_inner. + */ + if (IsA(path, IndexPath)) { + IndexOptInfo *info = ((IndexPath *) path)->indexinfo; + if (!info->unique) + continue; + /* + * What's going on here is a little bit subtle. The only possible + * source of pathkeys for the index path is the columns of the + * index, so we know that the pathkeys are derived from the index + * columns. However, we don't know that they're all present in + * the index-path's list of pathkeys, because the planner is free + * to truncate the pathkey list if the lower-order terms aren't + * useful to the current query. So we must check that the lists + * are of equal length. (If they aren't, the unique index isn't + * "unique enough", because it has lower-order terms.) + */ + if (info->ncolumns == list_length(path->pathkeys) + && compare_pathkeys(inner_pathkeys_for_merge, path->pathkeys) + == PATHKEYS_EQUAL) + { + useful_index_found = true; + break; + } + /* + * Since we found a unique index, there's still hope, even if this + * set of pathkeys doesn't pan out. + */ + any_hope_left = true; + } + /* + * In the future, we might want to consider other ways to prove that + * the left join can't increase the row count, such as a LEFT JOIN + * against a merge-joinable grouped subselect, like this: + * SELECT + * c.* + * FROM + * customer c + * LEFT JOIN (SELECT customer_id, SUM(1) FROM customer_order + * GROUP BY 1) co + * ON c.id = co.customer_id + */ + } + + /* + * We found a useful unique index. But we still can't remove the join + * unless the rel supplies no attrs. + */ + if (useful_index_found) + { + if (rel_attrs_needed_above_join(innerrel, joinrel)) + { + /* + * If the innerrel is supplying attributes, join removal fails, + * and there's no point in rescanning with a different set of + * pathkeys, so kill off any further attempts. + */ + *removejoinOK = false; + return false; + } + return true; + } + + /* No luck. */ + if (!any_hope_left) + *removejoinOK = false; + return false; + } + + /* + * rel_attrs_needed_above_join + * Determine whether any attributes from rel are used outside of joinrel. + * + * As a micro-optimization, it seems better to start with max_attr and count + * down rather than starting with min_attr and counting up, on the theory that + * the system attributes are somewhat less likely to be what is wanted and + * should be tested last. + */ + static bool + rel_attrs_needed_above_join(RelOptInfo *rel, RelOptInfo *joinrel) + { + int attroff; + for (attroff = rel->max_attr - rel->min_attr; attroff >= 0; --attroff) + if (!bms_is_subset(rel->attr_needed[attroff], joinrel->relids)) + return true; + return false; + } *** a/src/backend/optimizer/util/pathnode.c --- b/src/backend/optimizer/util/pathnode.c *************** *** 340,346 **** add_path(RelOptInfo *parent_rel, Path *new_path) /* * Delete the data pointed-to by the deleted cell, if possible */ ! if (!IsA(old_path, IndexPath)) pfree(old_path); /* Advance list pointer */ if (p1_prev) --- 340,346 ---- /* * Delete the data pointed-to by the deleted cell, if possible */ ! if (!IsA(old_path, IndexPath) && old_path->parent == parent_rel) pfree(old_path); /* Advance list pointer */ if (p1_prev) *************** *** 378,388 **** add_path(RelOptInfo *parent_rel, Path *new_path) else { /* Reject and recycle the new path */ ! if (!IsA(new_path, IndexPath)) pfree(new_path); } } /***************************************************************************** * PATH NODE CREATION ROUTINES --- 378,400 ---- else { /* Reject and recycle the new path */ ! if (!IsA(new_path, IndexPath) && new_path->parent == parent_rel) pfree(new_path); } } + /* + * add_paths_from_list + * Add each path from the list to parent_rel. + */ + void + add_paths_from_list(RelOptInfo *parent_rel, List *pathlist) + { + ListCell *lc; + + foreach (lc, pathlist) + add_path(parent_rel, (Path *) lfirst(lc)); + } /***************************************************************************** * PATH NODE CREATION ROUTINES *** a/src/backend/utils/misc/guc.c --- b/src/backend/utils/misc/guc.c *************** *** 648,653 **** static struct config_bool ConfigureNamesBool[] = --- 648,661 ---- true, NULL, NULL }, { + {"enable_joinremoval", PGC_USERSET, QUERY_TUNING_METHOD, + gettext_noop("Enables the planner's use of join removal."), + NULL + }, + &enable_joinremoval, + true, NULL, NULL + }, + { {"enable_hashjoin", PGC_USERSET, QUERY_TUNING_METHOD, gettext_noop("Enables the planner's use of hash join plans."), NULL *** a/src/backend/utils/misc/postgresql.conf.sample --- b/src/backend/utils/misc/postgresql.conf.sample *************** *** 190,195 **** --- 190,196 ---- #enable_seqscan = on #enable_sort = on #enable_tidscan = on + #enable_joinremoval = on # - Planner Cost Constants - *** a/src/include/optimizer/cost.h --- b/src/include/optimizer/cost.h *************** *** 59,64 **** extern bool enable_hashagg; --- 59,65 ---- extern bool enable_nestloop; extern bool enable_mergejoin; extern bool enable_hashjoin; + extern bool enable_joinremoval; extern int constraint_exclusion; extern double clamp_row_est(double nrows); *** a/src/include/optimizer/pathnode.h --- b/src/include/optimizer/pathnode.h *************** *** 26,31 **** extern int compare_fractional_path_costs(Path *path1, Path *path2, --- 26,32 ---- double fraction); extern void set_cheapest(RelOptInfo *parent_rel); extern void add_path(RelOptInfo *parent_rel, Path *new_path); + extern void add_paths_from_list(RelOptInfo *parent_rel, List *pathlist); extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel); extern IndexPath *create_index_path(PlannerInfo *root, *** a/src/test/regress/expected/rangefuncs.out --- b/src/test/regress/expected/rangefuncs.out *************** *** 1,16 **** SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%'; ! name | setting ! -------------------+--------- ! enable_bitmapscan | on ! enable_hashagg | on ! enable_hashjoin | on ! enable_indexscan | on ! enable_mergejoin | on ! enable_nestloop | on ! enable_seqscan | on ! enable_sort | on ! enable_tidscan | on ! (9 rows) CREATE TABLE foo2(fooid int, f2 int); INSERT INTO foo2 VALUES(1, 11); --- 1,17 ---- SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%'; ! name | setting ! --------------------+--------- ! enable_bitmapscan | on ! enable_hashagg | on ! enable_hashjoin | on ! enable_indexscan | on ! enable_joinremoval | on ! enable_mergejoin | on ! enable_nestloop | on ! enable_seqscan | on ! enable_sort | on ! enable_tidscan | on ! (10 rows) CREATE TABLE foo2(fooid int, f2 int); INSERT INTO foo2 VALUES(1, 11);
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers