There are a couple of open items for the parallel CREATE INDEX patch
that at this point represent blockers to commit, IMV. The first is
around a deficiency in the shared refcount mechanism, which is well
understood and doesn't need to be rehashed on this thread. The second
is the cost model, which is what I want to talk about here.

Currently, the cost model scales the number of workers at logarithmic
intervals, in the style of compute_parallel_worker(), but without
considering heap pages (that's actually broken out, for now). I'm
calling a new function that lives in planner.c, right next to
plan_cluster_use_sort(). ISTM that we ought to be considering both
heap pages and indexes pages, which makes the new signature of
compute_parallel_worker() (which now anticipates the needs of parallel
index scans) interesting to me, so that's something that I need to

Right now, as of V8, we:

0. See if the parallel_workers *index* storage parameter is set (I've
added this new storage parameter, but I think that Amit has or will
add the same new index storage parameter for parallel index scan [1]).

1. Estimate/project the size of the finished index, using a new
nbtpage.c function. Note that this does the right thing with partial
indexes and things like that. It uses pg_statistic stats, where

(My testing patch 0002-* has had an SQL-callable function that lets
reviewers easily determine what the projected size of some existing
index is, which might be a good idea to polish up and include as a
general purpose tool, apropos of nothing -- it is typically very
accurate [2]).

2. Input that size into the compute_parallel_worker()-style
logarithmic scaling of number of workers.

3. Calculate how many workers will have at least a full
maintenance_work_mem share doled out, while still having at least
min_parallel_relation_size of workMem in tuplesort.c.

(I guess I should say min_parallel_table_scan_size or even
min_parallel_index_scan_size now, but whatever).

4. Return the minimum worker calculation from either one of steps 3 and 4.

So, a low maintenance_work_mem may cap our original suggested number
of workers. This cap isn't particularly likely to be applied, though.
Note also that the max_parallel_workers_maintenance GUC is given the
opportunity to cap things off. This is the utility statement
equivalent of max_parallel_workers_per_gather.

Issues with this:

* This scales based on output size (projected index size), not input
size (heap scan input). Apparently, that's what we always do right

* This is dissimilar to how we cost parallel seq scans. There, the
only cost associated with going with a parallel access path is fixed
startup overheads, and IPC costs (paid in parallel_tuple_cost units).
So, we're not doing a comparison against a serial and parallel plan,
even though we might want to, especially because parallel CREATE INDEX
always uses temp files, unlike serial CREATE INDEX. cost_sort() is
never involved in any of this, and in any case isn't prepared to cost
parallel sorts right now.

* OTOH, there is less sense in doing the equivalent of charging for
IPC overhead that something like a Gather node incurs costs for during
planning, because to some degree the IPC involved is inherently
necessary. If you were going to get an external sort anyway, well,
that still involves temp files that are written to and read from.
Whether or not it's the same backend that does the writing as the
reading in all cases may matter very little. So, the main factor that
discourages parallel sequential scans doesn't really exist for
parallel CREATE INDEX.

I am tempted to move to something closer to what you see elsewhere,
were a serial path and partial path are both created. This would make
some sense for parallel CREATE INDEX, because you happen to have the
issue of parallelism effectively forcing an external sort. But
otherwise, it wouldn't make that much sense, because parallelism is
always going to help up to the point that all cores are in use, or at
least not hurt. Testing shows this to be the case. It's not as if
there are obvious IPC costs that push things against parallelism. This
is generally good, because the danger of using too many workers is
much less pronounced -- it's demonstrably very small, especially if
you assume a baseline of a serial external sort based CREATE INDEX.

What direction does the cost model need to go in?

I still lean towards the approach V8 takes, though the scaling should
possibly use heap pages and index pages with
compute_parallel_worker(), while not worrying about the input/output
distinction. It's not very appealing to have to teach cost_sort()
about any of this, since the things that considers currently are hard
to map onto parallelism. Besides, it's not as if there is a world of
difference between a serial internal sort CREATE INDEX, and a parallel
external sort CREATE INDEX with lots of memory. It's not a potentially
very large difference, as we see with the sort vs. index scan for
CLUSTER issue considered by plan_cluster_use_sort(). I think that
cost_sort() worries too much about some things, but not enough about
other things [3], so it's hard to imagine that it will ever
consistently do the right thing when we're expecting only small
differences in cost.

We could always defer the cost model to another release, and only
support the storage parameter for now, though that has disadvantages,
some less obvious [4].

Peter Geoghegan

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to