On Mon, Mar 27, 2023 at 7:04 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > I found another problem. I realised that ... FULL JOIN ... LIMIT n > might be able to give wrong answers with unlucky scheduling. > Unfortunately I have been unable to reproduce the phenomenon I am > imagining yet but I can't think of any mechanism that prevents the > following sequence of events: > > P0 probes, pulls n tuples from the outer relation and then the > executor starts to shut down (see commit 3452dc52 which pushed down > LIMIT), but just then P1 attaches, right before P0 does. P1 > continues, and finds < n outer tuples while probing and then runs out > so it enters the unmatched scan phase, and starts emitting bogusly > unmatched tuples. Some outer tuples we needed to get the complete set > of match bits and thus the right answer were buffered inside P0's > subplan and abandoned. > > I've attached a simple fixup for this problem. Short version: if > you're abandoning your PHJ_BATCH_PROBE phase without reaching the end, > you must be shutting down, so the executor must think it's OK to > abandon tuples this process has buffered, so it must also be OK to > throw all unmatched tuples out the window too, as if this process was > about to emit them. Right?
I understand the scenario you are thinking of, however, I question how those incorrectly formed tuples would ever be returned by the query. The hashjoin would only start to shutdown once enough tuples had been emitted to satisfy the limit, at which point, those tuples buffered in p0 may be emitted by this worker but wouldn't be included in the query result, no? I suppose even if what I said is true, we do not want the hashjoin node to ever produce incorrect tuples. In which case, your fix seems correct to me. > With all the long and abstract discussion of hard to explain problems > in this thread and related threads, I thought I should take a step > back and figure out a way to demonstrate what this thing really does > visually. I wanted to show that this is a very useful feature that > unlocks previously unobtainable parallelism, and to show the > compromise we've had to make so far in an intuitive way. With some > extra instrumentation hacked up locally, I produced the attached > "progress" graphs for a very simple query: SELECT COUNT(*) FROM r FULL > JOIN s USING (i). Imagine a time axis along the bottom, but I didn't > bother to add numbers because I'm just trying to convey the 'shape' of > execution with relative times and synchronisation points. > > Figures 1-3 show that phases 'h' (hash) and 'p' (probe) are > parallelised and finish sooner as we add more processes to help out, > but 's' (= the unmatched inner tuple scan) is not. Note that if all > inner tuples are matched, 's' becomes extremely small and the > parallelism is almost as fair as a plain old inner join, but here I've > maximised it: all inner tuples were unmatched, because the two > relations have no matches at all. Even if we achieve perfect linear > scalability for the other phases, the speedup will be governed by > https://en.wikipedia.org/wiki/Amdahl%27s_law and the only thing that > can mitigate that is if there is more useful work those early-quitting > processes could do somewhere else in your query plan. > > Figure 4 shows that it gets a lot fairer in a multi-batch join, > because there is usually useful work to do on other batches of the > same join. Notice how processes initially work on loading, probing > and scanning different batches to reduce contention, but they are > capable of ganging up to load and/or probe the same batch if there is > nothing else left to do (for example P2 and P3 both work on p5 near > the end). For now, they can't do that for the s phases. (BTW, the > little gaps before loading is the allocation phase that I didn't > bother to plot because they can't fit a label on them; this > visualisation technique is a WIP.) > > With the "opportunistic" change we are discussing for later work, > figure 4 would become completely fair (P0 and P2 would be able to join > in and help out with s6 and s7), but single-batch figures 1-3 would > not (that would require a different executor design AFAICT, or a > eureka insight we haven't had yet; see long-winded discussion). Cool diagrams! > The last things I'm thinking about now: Are the planner changes > right? I think the current changes are correct. I wonder if we have to change anything in initial/final_cost_hashjoin to account for the fact that for a single batch full/right parallel hash join, part of the execution is serial. And, if so, do we need to consider the estimated number of unmatched tuples to be emitted? > Are the tests enough? So, the tests currently in the patch set cover the unmatched tuple scan phase for single batch parallel full hash join. I've attached the dumbest possible addition to that which adds in a multi-batch full parallel hash join case. I did not do any checking to ensure I picked the case which would add the least execution time to the test, etc. Of course, this does leave the skip_unmatched code you added uncovered, but I think if we had the testing infrastructure to test that, we would be on a beach somewhere reading a book instead of beating our heads against the wall trying to determine if there are any edge cases we are missing in adding this feature. - Melanie
From e357b1299be2ef7a2949825af54572afc3e734eb Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Thu, 30 Mar 2023 15:04:48 -0400 Subject: [PATCH v1] Add multi-batch parallel full hash join test --- src/test/regress/expected/join_hash.out | 7 +++++++ src/test/regress/sql/join_hash.sql | 2 ++ 2 files changed, 9 insertions(+) diff --git a/src/test/regress/expected/join_hash.out b/src/test/regress/expected/join_hash.out index 027f3888b0..09376514bb 100644 --- a/src/test/regress/expected/join_hash.out +++ b/src/test/regress/expected/join_hash.out @@ -304,6 +304,13 @@ $$); t | f (1 row) +-- parallel full multi-batch hash join +select count(*) from simple r full outer join simple s using (id); + count +------- + 20000 +(1 row) + rollback to settings; -- The "bad" case: during execution we need to increase number of -- batches; in this case we plan for 1 batch, and increase at least a diff --git a/src/test/regress/sql/join_hash.sql b/src/test/regress/sql/join_hash.sql index ba1b3e6e1b..179e94941c 100644 --- a/src/test/regress/sql/join_hash.sql +++ b/src/test/regress/sql/join_hash.sql @@ -187,6 +187,8 @@ select original > 1 as initially_multibatch, final > original as increased_batch $$ select count(*) from simple r join simple s using (id); $$); +-- parallel full multi-batch hash join +select count(*) from simple r full outer join simple s using (id); rollback to settings; -- The "bad" case: during execution we need to increase number of -- 2.37.2