On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > Please find attached latest patch which fix the review point as well as > additional clean-up.
I've signed up to review this patch and I'm planning to do some testing. Here's some initial feedback after a quick read-through: + if (gather_merge_readnext(gm_state, i, initialize ? false : true)) Clunky ternary operator... how about "!initialize". +/* + * Function clear out a slot in the tuple table for each gather merge + * slots and returns the clear clear slot. + */ Maybe better like this? "_Clear_ out a slot in the tuple table for each gather merge _slot_ and _return_ the _cleared_ slot." + /* Free tuple array as we no more need it */ "... as we don't need it any more" +/* + * Read the next tuple for gather merge. + * + * Function fetch the sorted tuple out of the heap. + */ "_Fetch_ the sorted tuple out of the heap." + * Otherwise, pull the next tuple from whichever participate we + * returned from last time, and reinsert the index into the heap, + * because it might now compare differently against the existing s/participate/participant/ + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California Shouldn't this say just "(c) 2016, PostgreSQL Global Development Group"? Are we supposed to be blaming the University of California for new files? +#include "executor/tqueue.h" +#include "miscadmin.h" +#include "utils/memutils.h" +#include "utils/rel.h" +#include "lib/binaryheap.h" Not correctly sorted. + /* + * store the tuple descriptor into gather merge state, so we can use it + * later while initilizing the gather merge slots. + */ s/initilizing/initializing/ +/* ---------------------------------------------------------------- + * ExecEndGatherMerge + * + * frees any storage allocated through C routines. + * ---------------------------------------------------------------- The convention in Postgres code seems to be to use a form like "Free any storage ..." in function documentation. Not sure if that's an imperative, an infinitive, or if the word "we" is omitted since English is so fuzzy like that, but it's inconsistent with other documentation to use "frees" here. Oh, I see that exact wording is in several other files. I guess I'll just leave this as a complaint about all those files then :-) + * Pull atleast single tuple from each worker + leader and set up the heap. s/atleast single/at least a single/ + * Read the tuple for given reader into nowait mode, and form the tuple array. s/ into / in / + * Function attempt to read tuple for the given reader and store it into reader s/Function attempt /Attempt / + * Function returns true if found tuple for the reader, otherwise returns s/Function returns /Return / + * First try to read tuple for each worker (including leader) into nowait + * mode, so that we initialize read from each worker as well as leader. I wonder if it would be good to standardise on the terminology we use when we mean workers AND the leader. In my Parallel Shared Hash work, I've been saying 'participants' if I mean = workers + leader. What do you think? + * After this if all active worker unable to produce the tuple, then + * re-read and this time read the tuple into wait mode. For the worker, + * which was able to produced single tuple in the earlier loop and still + * active, just try fill the tuple array if more tuples available. + */ How about this? "After this, if all active workers are unable to produce a tuple, then re-read and this time us wait mode. For workers that were able to produce a tuple in the earlier loop and are still active, just try to fill the tuple array if more tuples are available." + * The heap is never spilled to disk, since we assume N is not very large. So + * this is much simple then cost_sort. s/much simple then/much simpler than/ + /* + * Avoid log(0)... + */ + N = (path->num_workers < 2) ? 2.0 : (double) path->num_workers; + logN = LOG2(N); ... + /* Per-tuple heap maintenance cost */ + run_cost += path->path.rows * comparison_cost * 2.0 * logN; Why multiply by two? The comment above this code says "about log2(N) comparisons to delete the top heap entry and another log2(N) comparisons to insert its successor". In fact gather_merge_getnext calls binaryheap_replace_first, which replaces the top element without any comparisons at all and then performs a sift-down in log2(N) comparisons to find its new position. There is no per-tuple "delete" involved. We "replace" the top element with the value it already had, just to trigger the sift-down, because we know that our comparator function might have a new opinion of the sort order of this element. Very clever! The comment and the 2.0 factor in cost_gather_merge seem to be wrong though -- or am I misreading the code? Also, shouldn't we add 1 to N to account for the leader? Suppose there are 2 workers. There are 3 elements in the binary heap. The element to be sifted down must be compared against either 1 or 2 others to reorganise the heap. Surely in that case we should estimate log2(3) = ~1.58 comparisons, not log2(2) = 1 comparison. I suspect that the leader's contribution will be equivalent to a whole worker if the plan involves a sort: as soon as the leader pulls a tuple in gather_merge_init, the sort node will pull all the tuples it can in a tight loop. It's unfortunate that cost_seqscan has to estimate what the leader's contribution will be without knowing whether it has a "greedy" high-startup-cost consumer like a sort or hash node where the leader will contribute a whole backend's full attention as soon as it executes the plan, or a lazy consumer where the leader will probably not contribute much if there are enough workers to keep it distracted. In the case of a Gather Merge -> Sort -> Parallel Seq Scan plan, I think we will overestimate the number of rows (per participant), because cost_seqscan will guess that the leader is spending 30% of its time per worker servicing the workers, when in fact it will be sucking tuples into a sort node just as fast as anyone else. But I don't see what this patch can do about that... + * When force is true, function reads the tuple into wait mode. For gather + * merge we need to fill the slot from which we returned the earlier tuple, so + * this require tuple to be read into wait mode. During initialization phase, + * once we try to read the tuple into no-wait mode as we want to initialize all + * the readers. Refer gather_merge_init() for more details. + * + * Function returns true if found tuple for the reader, otherwise returns + * false. + */ +static bool +gather_merge_readnext(GatherMergeState *gm_state, int reader, bool force) s/into wait mode/in wait mode/ This appears throughout the comments; not sure if I can explain this well but "in wait mode" describes a state of being which is wanted here, "into wait mode" describes some kind of change or movement or insertion. Perhaps it would be better to say "reads the tuple _queue_ in wait mode", just to make clearer that this is talking about the wait/nowait feature of tuple queues, and perhaps also note that the leader always waits since it executes the plan. Maybe we should use "bool nowait" here anway, mirroring the TupleQueue interface? Why introduce another terminology for the same thing with inverted sense? +/* + * Read the tuple for given reader into nowait mode, and form the tuple array. + */ +static void +form_tuple_array(GatherMergeState *gm_state, int reader) This function is stangely named. How about try_to_fill_tuple_buffer or something? + GMReaderTuple *gm_tuple = &gm_state->gm_tuple[reader]; I wonder if the purpose of gm_tuple, would be clearer if it were called gm_tuple_buffers. Plural because it holds one buffer per reader. Then in that variable on the left hand side there could be called tuple_buffer (singular), because it's the buffer of tuples for one single reader. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers