I came across a query where the plan includes some unnecessary Sort nodes. Here's an example that shows the issue.
create table t(a int, b int); insert into t select i%100, i from generate_series(1,10000)i; create index on t(a); vacuum analyze t; set enable_hashagg to off; explain (costs off) select * from t t1 where t1.a in (select a from t t2 where a < 10) order by t1.a; QUERY PLAN --------------------------------------------------------------- Merge Join Merge Cond: (t1.a = t2.a) -> Index Scan using t_a_idx on t t1 -> Sort Sort Key: t2.a -> Unique -> Sort Sort Key: t2.a -> Index Only Scan using t_a_idx on t t2 Index Cond: (a < 10) (10 rows) I believe the two Sort nodes are unnecessary. After some digging, it seems that this is related to one of the approaches we use to implement semijoins: unique-ifying the RHS and then doing a regular join. The unique-ification is handled in create_unique_path(), which considers both hash-based and sort-based implementations. However, in the case of sort-based implementation, this function pays no attention to the subpath's pathkeys or the pathkeys of the resulting output. In addition to this specific issue, it seems to me that there are other potential issues in create_unique_path(). * Currently, we only consider the cheapest_total_path of the RHS when unique-ifying it. I think this may cause us to miss some optimization opportunities. For example, there might be a path with a better sort order that isn't the cheapest-total one. Such a path could help avoid a sort at a higher level, potentially resulting in a cheaper overall plan. * In create_unique_path(), we currently rely on heuristics to decide whether to use a hash-based or sort-based method. I think a better approach would be to create paths for both methods and let add_path() determine which one is better, similar to how we handle path selection in other parts of the planner. Therefore, I'm thinking that maybe we could create a new RelOptInfo for the RHS rel to represent its unique-ified version, and then generate all worthwhile paths for it, similar to how it's done in create_distinct_paths(). Since this is likely to be called repeatedly on the same rel, we can cache the new RelOptInfo in the rel struct, just like how we cache cheapest_unique_path currently. To be concrete, I'm envisioning something like the following: if (bms_equal(sjinfo->syn_righthand, rel2->relids) && - create_unique_path(root, rel2, rel2->cheapest_total_path, - sjinfo) != NULL) + (rel2_unique = create_unique_rel(root, rel2, sjinfo)) != NULL) ... - add_paths_to_joinrel(root, joinrel, rel1, rel2, - JOIN_UNIQUE_INNER, sjinfo, + add_paths_to_joinrel(root, joinrel, rel1, rel2_unique, + JOIN_INNER, sjinfo, restrictlist); - add_paths_to_joinrel(root, joinrel, rel2, rel1, - JOIN_UNIQUE_OUTER, sjinfo, + add_paths_to_joinrel(root, joinrel, rel2_unique, rel1, + JOIN_INNER, sjinfo, restrictlist); In addition, by changing the join from "rel1" and "rel2" using JOIN_UNIQUE_OUTER or JOIN_UNIQUE_INNER to a join between "rel1" and "rel2_unique" using a standard JOIN_INNER, we might be able to get rid of the JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER jointypes. This could simplify a lot of logic in joinpath.c, where we're increasingly adding special-case handling for these two jointypes. One last point, I doubt that the code related to UNIQUE_PATH_NOOP is reachable in practice. create_unique_path() checks whether the input rel is an RTE_RELATION or RTE_SUBQUERY and is provably unique, and creates a UNIQUE_PATH_NOOP UniquePath if so. However, in that case, the semijoin should have already been simplified to a plain inner join by analyzejoins.c. Any thoughts? Thanks Richard