Re: [HACKERS] Parallel Seq Scan

Robert Haas Mon, 20 Apr 2015 11:27:44 -0700

On Wed, Apr 8, 2015 at 3:34 AM, David Rowley <dgrowle...@gmail.com> wrote:
> On 8 April 2015 at 14:24, Robert Haas <robertmh...@gmail.com> wrote:
>> I think one of the philosophical questions that has to be answered
>> here is "what does it mean to talk about the cost of a parallel
>> plan?".  For a non-parallel plan, the cost of the plan means both "the
>> amount of effort we will spend executing the plan" and also "the
>> amount of time we think the plan will take to complete", but those two
>> things are different for parallel plans.  I'm inclined to think it's
>> right to view the cost of a parallel plan as a proxy for execution
>> time, because the fundamental principle of the planner is that we pick
>> the lowest-cost plan.  But there also clearly needs to be some way to
>> prevent the selection of a plan which runs slightly faster at the cost
>> of using vastly more resources.
>
> I'd agree with that as far as CPU costs, or maybe I'd just disagree with the
> alternative, as if we costed in <cost of individual worker's work> * <number
> of workers> then we'd never choose a parallel plan, as by the time we costed
> in tuple communication costs between the processes a parallel plan would
> always cost more than the serial equivalent. I/O costs are different, I'd
> imagine these shouldn't be divided by the estimated number of workers.


It's hard to say.  If the I/O is from the OS buffer cache, then
there's no reason why several workers can't run in parallel.   And
even if it's from the actual storage, we don't know what degree of I/O
parallelism will be possible.  Maybe effective_io_concurrency should
play into the costing formula somehow, but it's not very clear to me
that captures the information we care about.  In general, I'm not sure
how common it is for the execution speed of a sequential scan to be
limited by I/O.

For example, on a pgbench database, scale factor 300, on a POWERPC
machine provided by IBM for performance testing (thanks, IBM!) a
cached read of the pgbench_accounts files took 1.122 seconds.  After
dropping the caches, it took 10.427 seconds.  "select * from
pgbench_accounts where abalance > 30000" took 10.244 seconds with a
cold cache and 5.029 seconds with a warm cache.  So on this particular
hardware, on this particular test, parallelism is useless if the cache
is cold, but it could be right to use ~4-5 processes for the scan if
the cache is warm.  However, we have no way of knowing whether the
cache will be cold or warm at execution time.

This isn't a new problem.  As it is, the user has to set seq_page_cost
and random_page_cost based on either a cold-cache assumption or a
warm-cache assumption, and if they guess wrong, their costing
estimates will be off (on this platform, on this test case) by 4-5x.
That's pretty bad, and it's totally unclear to me what to do about it.
I'm guessing it's unclear to other people, too, or we would likely
have done something about it by now.

>> Some ideas for GUCs:
>>
>> max_parallel_degree = The largest number of processes we'll consider
>> using for a single query.
>> min_parallel_speedup = The minimum percentage by which a parallel path
>> must be cheaper (in terms of execution time) than a non-parallel path
>> in order to survive.  I'm imagining the default here might be
>> something like 15%.
>> min_parallel_speedup_per_worker = Like the previous one, but per
>> worker.  e.g. if this is 5%, which might be a sensible default, then a
>> plan with 4 workers must be at least 20% better to survive, but a plan
>> using only 2 workers only needs to be 10% better.
>
> max_parallel_degree feels awfully like it would have to be set
> conservatively, similar to how work_mem is today. Like with work_mem, during
> quiet periods it sure would be nice if it could magically increase.

Absolutely.  But, similar to work_mem, that's a really hard problem.
We can't know at plan time how much work memory, or how many CPUs,
will be available at execution time.  And even if we did, it need not
be constant throughout the whole of query execution.  It could be that
when execution starts, there's lots of memory available, so we do a
quicksort rather than a tape-sort.  But midway through the machine
comes under intense memory pressure and there's no way for the system
to switch strategies.

Now, having said that, I absolutely believe that it's correct for the
planner to make the initial decisions in this area.  Parallelism
changes the cost of execution nodes, and it's completely wrong to
assume that this couldn't alter planner decisions at higher levels of
the plan tree.  At the same time, it's pretty clear that it would be a
great thing for the executor to be able to adjust the strategy if the
planner's assumptions don't pan out, or if conditions have changed.

For example, if we choose a seq-scan-sort-and-filter over an
index-scan-and-filter thinking that we'll be able to do a quicksort,
and then it turns out that we're short on memory, it's too late to
switch gears and adopt the index-scan-and-filter plan after all.
That's long since been discarded.  But it's still better to switch to
a heap sort than to persist with a quicksort that's either going to
fail outright, or (maybe worse) succeed but drive the machine into
swap, which will just utterly obliterate performance.

>> An additional benefit of this line of thinking is that planning would
>> always produce a best non-parallel path.  And sometimes, there would
>> also be a best parallel path that is expected to run faster.  We could
>> then choose between them dynamically at execution time.
>
> Actually store 2 plans within the plan? Like with an AlternativePlanNode?

Yeah.  I'm not positive that's a good idea, but it seems like might be.

>> I think it's pretty hard to imagine a scenario as extreme as the one
>> you mention above ever actually occurring in practice.  I mean, even
>> the most naive implementation of parallel query will presumably have
>> something like max_parallel_degree, and you probably won't have that
>> set to 128.  For starters, it can't possibly make sense unless you
>> server has at least 128 CPUs, and even then it only makes sense if you
>> don't mind a single query using all of them, and even if the first of
>> those things is true, the second one probably isn't.  I don't doubt
>> that less extreme variants of this scenario are possible, though.
>
> Yeah maybe, it does seem quite extreme, but maybe less so as the years roll
> on a bit... perhaps in 5-10 years it might be quite common to have that many
> spare CPU cores to throw at a task.

That is certainly possible, but we need to start small. It's
completely OK for the first version of this feature to have some rough
edges that get improved later.  Indeed, it's absolutely vital, or
we'll never get this thing off the ground.

> I think if we have this percentage GUC you mentioned to prefer parallel
> plans if they're within a % threshold of the serial plan, then we could end
> up with problems with I/O and buffers getting thrown out of caches due to
> the extra I/O involved in parallel plans going with seq scans instead of
> serial plans choosing index scans.

That's possible, but the non-parallel planner doesn't account for
caching effects, either.

> In summary it sounds like with my idea we get:
>
> Pros
> * Optimal plan if no workers are available at execution time.
> * Parallelism possible if the chosen optimal plan happens to support
> parallelism, e.g not index scan.
> * No planning overhead

The third one isn't really true.  You've just moved some of the
planning to execution time.

> Cons:
> * The plan "Parallelizer" must make changes to the plan just before
> execution time, which ruins the 1 to 1 ratio of plan/executor nodes by the
> time you inject Funnel nodes.
>
> If we parallelise during planning time:
>
> Pros
> * More chance of getting a parallel friendly plan which could end up being
> very fast if we get enough workers at executor time.

This, to me, is by far the biggest "con" of trying to do something at
execution time.  If planning doesn't take into account the gains that
are possible from parallelism, then you'll only be able to come up
with the best parallel plan when it happens to be a parallelized
version of the best serial plan.  So long as the only parallel
operator is parallel seq scan, that will probably be a common
scenario.  But once we assemble a decent selection of parallel
operators, and a reasonably intelligent parallel query optimizer, I'm
not so sure it'll still be true.

> Cons:
> * May produce non optimal plans if no worker processes are available during
> execution time.
> * Planning overhead for considering parallel paths.
> * The parallel plan may blow out buffer caches due to increased I/O of
> parallel plan.
>
> Of course please say if I've missed any pro or con.

I think I generally agree with your list; but we might not agree on
the relative importance of the items on it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

Reply via email to