Re: [HACKERS] Push down more full joins in postgres_fdw

Etsuro Fujita Mon, 24 Oct 2016 02:05:46 -0700

On 2016/10/22 17:19, Ashutosh Bapat wrote:

Review for postgres-fdw-more-full-join-pushdown-v6 patch.


The patch applies cleanly and regression is clean (make check in
regress directory and that in postgres_fdw).


Thanks for the review!

Here are some comments.
1. Because of the following code change, for a joinrel we might end up using
attrs_used, which will be garbage for a join rel. The assumption that tlist can
not be NIL for a join relation is wrong. Even for a join relation, tlist can be
NULL.  Consider query 'explain verbose select count(*) from ft1, ft2' Since
count(*) doesn't need any columns, the tlist is NIL. With the patch the server
crashes for this query.
-    if (foreignrel->reloptkind == RELOPT_JOINREL)
+    /*
+     * Note: tlist for a base relation might be non-NIL.  For example, if the
+     * base relation is an operand of a foreign join performing a full outer
+     * join and has non-NIL remote_conds, the base relation will be deparsed
+     * as a subquery, so the tlist for the base relation could be non-NIL.
+     */
+    if (tlist != NIL)
     {
-        /* For a join relation use the input tlist */

We can not decide whether to use deparseExplicitTargetList or
deparse it from attrs_used based on the tlist. SELECT lists, whether it's
topmost SELECT or a subquery (even on a base relation), for deparsing a
JOIN query need to use deparseExplicitTargetList.


Will fix.

2. The code in deparseRangeTblRef() dealing with subqueries, is very similar to
deparseSelectStmtForRel(). The only thing deparseRangeTblRef() does not require
is the appending ORDER BY, which is determined by existence of pathkeys
argument. Probably we should reuse deparseSelectStmtForRel(), instead of
duplicating the code. This will also make the current changes to
deparseSelectSql unnecessary.


Will do.

3. Why do we need following change? The elog() just few lines above says that
we expect only Var nodes. Why then we use deparseExpr() instead of
deparseVar()?
         if (i > 0)
             appendStringInfoString(buf, ", ");
-        deparseVar(var, context);
+        deparseExpr((Expr *) var, context);

         *retrieved_attrs = lappend_int(*retrieved_attrs, i + 1);

And I get the answer for that question, somewhere down in the patch
+    /*
+     * If the given expression is an output column of a subquery-in-FROM,
+     * deparse the alias to the expression instead.
+     */
+    if (IsA(node, Var))
+    {
+        int            tabno;
+        int            colno;
+
+        if (isSubqueryExpr(node, context->foreignrel, &tabno, &colno))
+        {
+            appendStringInfo(context->buf, "%s%d.%s%d",
+                             SS_TAB_ALIAS_PREFIX, tabno,
+                             SS_COL_ALIAS_PREFIX, colno);
+            return;
+        }
+    }
+

Functionally, this code belongs to deparseVar() and not deparseExpr(). Like all
other functions which handle Expr nodes, deparseExpr() is expected to be a
dispatcher to the node specific function.


OK, will change that way.

4. This comment is good to have there, but is unrelated to this patch. May be a
separate patch?
+    /* Don't generate bad syntax if no columns */

5. Changed comment doesn't say anything different from the original comment. It
may have been good to have it this way to start with, but changing it now
doesn't bring anything new. You seem to have just merged prologue of the
function with a comment in that function. I think, this change is unnecessary.
- * The function constructs ... JOIN ... ON ... for join relation. For a base
- * relation it just returns schema-qualified tablename, with the appropriate
- * alias if so requested.
+ * For a join relation the clause of the following form is appended to buf:
+ * ((outer relation) <join type> (inner relation) ON (joinclauses))
+ * For a base relation the function just adds the schema-qualified tablename,
+ * with the appropriate alias if so requested.
+    /* Don't generate bad syntax if no columns */

6. Change may be good but unrelated to this patch. May be material for a
separate patch. There are few such changes in this function. While these
changes may be good by themselves, they distract reviewer from the goal of the
patch.
-            appendStringInfo(buf, "(");
+            appendStringInfoChar(buf, '(');

OK on #4, #5, and #6, Will remove all the changes. I will leave thosefor separate patches.

7. I don't understand why you need to change this function so much. Take for
example the following change.
-        RelOptInfo *rel_o = fpinfo->outerrel;
-        RelOptInfo *rel_i = fpinfo->innerrel;
-        StringInfoData join_sql_o;
-        StringInfoData join_sql_i;
+        /* Begin the FROM clause entry */
+        appendStringInfoChar(buf, '(');

         /* Deparse outer relation */
-        initStringInfo(&join_sql_o);
-        deparseFromExprForRel(&join_sql_o, root, rel_o, true, params_list);
+        deparseRangeTblRef(buf, root,
+                           fpinfo->outerrel,
+                           fpinfo->make_outerrel_subquery,
+                           params_list);
         /* Deparse outer relation */
-        initStringInfo(&join_sql_o);
-        deparseFromExprForRel(&join_sql_o, root, rel_o, true, params_list);
+        deparseRangeTblRef(buf, root,
+                           fpinfo->outerrel,
+                           fpinfo->make_outerrel_subquery,
+                           params_list);

It removes few variables, instead directly accesses the members from fpinfo.
Also, deparses the individual relations in the provided buffer directly, rather
than in separate buffers in the original code. Instead of this, all you had to
do was replace a call
-        deparseFromExprForRel(&join_sql_o, root, rel_o, true, params_list);
with
+        deparseRangeTblRef(&join_sql_o, root, rel_o,
fpinfo->make_outerrel_subquery, params_list)
Similarly for the inner relation. Again, the changes you have done might have
been good, if done in the original code, but doing those in this patch just
creates distractions and increases the size of the patch.

I did that for efficiency because deparsing the relation directly intothe given buffer would save cycles, but I agree that that is anotherissue. Will remove that change, and leave that for another patch.

8. Why have you changed the name of variable here?
-        if (use_alias)
+        if (add_rel_alias)


Sorry, I forgot to remove this change from the patch.  Will fix.

9. In deparseRangeTblRef(), the targetlist for the subquery is obtained by
calling build_tlist_to_deparse(), but appendSubselectAlias() constructs the
column aliases based on foreignrel->reltarget->exprs. Those two are usually
same, but there is no code which guarantees that. Instead, pass tlist to
appendSubselectAlias() or simply pass the number of entries in that list
(list_length(tlist), which is all it needs. OR use foreignrel->reltarget->exprs
directly instead of calling build_tlist_to_deparse(). Similar problem exists
with getSubselectAliasInfo().

Actually, the patch guarantees that since in that case (1)foreignrel->reltarget->exprs doesn't contain any PHVs (note that passingthe following check in foreign_join_ok implies thatforeignrel->reltarget->exprs of the input rels for a given foreign joindoesn't contain any PHVs and (2) we have fpinfo->local_conds == NIL, sobuild_tlist_to_deparse() would produce a tlist equivalent to theforeignrel->reltarget->exprs. But as you proposed, to make the codeeasier to understand, I will change to use foreignrel->reltarget->exprsdirectly.

/*

* deparseExplicitTargetList() isn't smart enough to handleanything other* than a Var. In particular, if there's some PlaceHolderVar thatwould

     * need to be evaluated within this join tree (because there's an upper
     * reference to a quantity that may go to NULL as a result of an outer

* join), then we can't try to push the join down because we'llfail when* we get to deparseExplicitTargetList(). However, aPlaceHolderVar that* needs to be evaluated *at the top* of this join tree is OK,because we

     * can do that locally after fetching the results from the remote side.
     */
    relids = joinrel->relids;
    foreach(lc, root->placeholder_list)
    {
        PlaceHolderInfo *phinfo = lfirst(lc);

        if (bms_is_subset(phinfo->ph_eval_at, relids) &&
            bms_nonempty_difference(relids, phinfo->ph_eval_at))
            return false;
    }

10. Deparsing a query in the form of a subquery makes it hard-to-read. The
original code aimed at avoiding a subquery. Also, this change has created many
expected output changes, which again seem unnecessary. In fact, having the
pushable join clauses of an inner join in ON clause, which is closer to JOIN
clause, is better than having them farther in the WHERE clause.
-    /*
-     * For an inner join, all restrictions can be treated alike. Treating the
-     * pushed down conditions as join conditions allows a top level full outer
-     * join to be deparsed without requiring subqueries.
-     */
-    if (jointype == JOIN_INNER)
-    {
-        Assert(!fpinfo->joinclauses);
-        fpinfo->joinclauses = fpinfo->remote_conds;
-        fpinfo->remote_conds = NIL;
-    }

I added this change for another patch that I proposed for extending theDML pushdown in 9.6 so that it can perform an update/delete on a joinremotely. To create a remote query for that, I think we need to pull upinner join conditions mentioning the target relation in the JOIN/ONclauses into the WHERE clause. But I have to admit that's unrelated tothis patch, so I will leave that as-is.

11. I have reworded following comment and restructured the code that follows it
in the attached patch.
+    /*
+     * Set the subquery information.  If the relation performs a full outer
+     * join and if the input relations have non-NIL remote_conds, the input
+     * relations need to be deparsed as a subquery.
+     */
The code esp. the if .. else .. block followed by another one, made it a bit
unreadable. Hence I restructured it so that it's readable and doesn't require
variable "relids" from earlier code, which was being reused. Let me know if
this change looks good.


That's a good idea.  Thanks!

12. The code below deletes the condition, which is understandable.
-            if (fpinfo_i->remote_conds || fpinfo_o->remote_conds)
But why does it add a new unrelated condition here? What the comment claims,
looks like an existing bug, unrelated to the patch. Can you please give an
example? If that it's an existing bug, it should be fixed as a separate patch.
I don't understand the relation of this code with what the patch is doing.
+            /*
+             * We can't do anything here, and if there are any non-Vars in the
+             * outerrel/innerrel's reltarget, give up pushing down this join,
+             * because the deparsing logic can't support such a case currently.
+             */
+            if (reltarget_has_non_vars(outerrel))
+                return false;
+            if (reltarget_has_non_vars(innerrel))
                 return false;

I thought the current deparsing logic wouldn't work well if the targetlist of a given subquery contained eg, system columns other than ctidand oid, but I noticed that I was wrong; it can, so we don't need thisrestriction. Will remove. (I don't think there is any bug.)

13. The comment below is missing the main purpose i.e. creating a a unique
alias, in case the relation gets converted into a subquery. Lowest or highest
relid will create a unique alias at given level of join and that would be more
future proof. If we decide to consider paths for every join order, following
comment will no more be true.
+    /*
+     * Set the relation index.  This is defined as the position of this
+     * joinrel in the join_rel_list list plus the length of the rtable list.
+     * Note that since this joinrel is at the end of the list when we are
+     * called, we can get the position by list_length.
+     */
+    fpinfo->relation_index =
+        list_length(root->parse->rtable) + list_length(root->join_rel_list);

I don't agree on that point. As I said before, the approach using thelowest/highest relid would make a remote query having nested subqueriesdifficult to read. And even if we decided to consider paths for everyjoin order, we could define the relation_index safely, by modifyingpostgresGetForeignJoinPaths so that it (1) sets the relation_index theproposed way at the first time it is called and (2) avoids setting therelation_index after the first call, by checking to seefpinfo->relation_index > 0.

14. The function name talks about non-vars but the Assert few lines below
asserts that every node there is a Var. Need better naming.
+reltarget_has_non_vars(RelOptInfo *foreignrel)
+{
+    ListCell   *lc;
+
+    foreach(lc, foreignrel->reltarget->exprs)
+    {
+        Var           *var = (Var *) lfirst(lc);
+
+        Assert(IsA(var, Var));
And also an explanation for the claim
+ * Note: currently deparseExplicitTargetList can't properly handle such Vars.


Will remove this function for the reason described in #12.

15. While deparsing every Var, we are descending the RelOptInfo hierarchy.

Right. I think that not only avoids creating redundant data to find thealias of a given Var node but also would be able to find it in optimal cost.

Instead of that, we should probably gather all the alias information once and
store it in context as an array indexed by relid. While deparsing a Var we can
get the targetlist and alias for a given relation by using var->varno as index.
And then search for given Var node in the targetlist to deparse the column name
by its position in the targetlist. For the relations that are not converted
into subqueries, this array will not hold any information and the Var will be
deparsed as it's being done today.

Sorry, I don't understand this part. How does that work when creating aremote query having nested subqueries? Is that able to search for agiven Var node efficiently?

Testing
-------
1. The code is changing deparser but doesn't have testcases
testing impact of the code. For example, we should have testcases with remote
query deparsed as nested subqueries or nested subqueries within subqueries and
so on May be testcases where a relation deeper in the RelOptInfo hierarchy has
conditions but it's immediate upper relation does not have those.


Will do.

2. The only testcase added by this patch, copies an existing case and adds a
whole row reference to one of the relations being joined. Instead we could add
that whole-row reference to the existing testcase itself.


Will do.

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Push down more full joins in postgres_fdw

Reply via email to