On Wed, Apr 8, 2015 at 3:34 AM, David Rowley <dgrowle...@gmail.com> wrote: > On 8 April 2015 at 14:24, Robert Haas <robertmh...@gmail.com> wrote: >> I think one of the philosophical questions that has to be answered >> here is "what does it mean to talk about the cost of a parallel >> plan?". For a non-parallel plan, the cost of the plan means both "the >> amount of effort we will spend executing the plan" and also "the >> amount of time we think the plan will take to complete", but those two >> things are different for parallel plans. I'm inclined to think it's >> right to view the cost of a parallel plan as a proxy for execution >> time, because the fundamental principle of the planner is that we pick >> the lowest-cost plan. But there also clearly needs to be some way to >> prevent the selection of a plan which runs slightly faster at the cost >> of using vastly more resources. > > I'd agree with that as far as CPU costs, or maybe I'd just disagree with the > alternative, as if we costed in <cost of individual worker's work> * <number > of workers> then we'd never choose a parallel plan, as by the time we costed > in tuple communication costs between the processes a parallel plan would > always cost more than the serial equivalent. I/O costs are different, I'd > imagine these shouldn't be divided by the estimated number of workers.
It's hard to say. If the I/O is from the OS buffer cache, then there's no reason why several workers can't run in parallel. And even if it's from the actual storage, we don't know what degree of I/O parallelism will be possible. Maybe effective_io_concurrency should play into the costing formula somehow, but it's not very clear to me that captures the information we care about. In general, I'm not sure how common it is for the execution speed of a sequential scan to be limited by I/O. For example, on a pgbench database, scale factor 300, on a POWERPC machine provided by IBM for performance testing (thanks, IBM!) a cached read of the pgbench_accounts files took 1.122 seconds. After dropping the caches, it took 10.427 seconds. "select * from pgbench_accounts where abalance > 30000" took 10.244 seconds with a cold cache and 5.029 seconds with a warm cache. So on this particular hardware, on this particular test, parallelism is useless if the cache is cold, but it could be right to use ~4-5 processes for the scan if the cache is warm. However, we have no way of knowing whether the cache will be cold or warm at execution time. This isn't a new problem. As it is, the user has to set seq_page_cost and random_page_cost based on either a cold-cache assumption or a warm-cache assumption, and if they guess wrong, their costing estimates will be off (on this platform, on this test case) by 4-5x. That's pretty bad, and it's totally unclear to me what to do about it. I'm guessing it's unclear to other people, too, or we would likely have done something about it by now. >> Some ideas for GUCs: >> >> max_parallel_degree = The largest number of processes we'll consider >> using for a single query. >> min_parallel_speedup = The minimum percentage by which a parallel path >> must be cheaper (in terms of execution time) than a non-parallel path >> in order to survive. I'm imagining the default here might be >> something like 15%. >> min_parallel_speedup_per_worker = Like the previous one, but per >> worker. e.g. if this is 5%, which might be a sensible default, then a >> plan with 4 workers must be at least 20% better to survive, but a plan >> using only 2 workers only needs to be 10% better. > > max_parallel_degree feels awfully like it would have to be set > conservatively, similar to how work_mem is today. Like with work_mem, during > quiet periods it sure would be nice if it could magically increase. Absolutely. But, similar to work_mem, that's a really hard problem. We can't know at plan time how much work memory, or how many CPUs, will be available at execution time. And even if we did, it need not be constant throughout the whole of query execution. It could be that when execution starts, there's lots of memory available, so we do a quicksort rather than a tape-sort. But midway through the machine comes under intense memory pressure and there's no way for the system to switch strategies. Now, having said that, I absolutely believe that it's correct for the planner to make the initial decisions in this area. Parallelism changes the cost of execution nodes, and it's completely wrong to assume that this couldn't alter planner decisions at higher levels of the plan tree. At the same time, it's pretty clear that it would be a great thing for the executor to be able to adjust the strategy if the planner's assumptions don't pan out, or if conditions have changed. For example, if we choose a seq-scan-sort-and-filter over an index-scan-and-filter thinking that we'll be able to do a quicksort, and then it turns out that we're short on memory, it's too late to switch gears and adopt the index-scan-and-filter plan after all. That's long since been discarded. But it's still better to switch to a heap sort than to persist with a quicksort that's either going to fail outright, or (maybe worse) succeed but drive the machine into swap, which will just utterly obliterate performance. >> An additional benefit of this line of thinking is that planning would >> always produce a best non-parallel path. And sometimes, there would >> also be a best parallel path that is expected to run faster. We could >> then choose between them dynamically at execution time. > > Actually store 2 plans within the plan? Like with an AlternativePlanNode? Yeah. I'm not positive that's a good idea, but it seems like might be. >> I think it's pretty hard to imagine a scenario as extreme as the one >> you mention above ever actually occurring in practice. I mean, even >> the most naive implementation of parallel query will presumably have >> something like max_parallel_degree, and you probably won't have that >> set to 128. For starters, it can't possibly make sense unless you >> server has at least 128 CPUs, and even then it only makes sense if you >> don't mind a single query using all of them, and even if the first of >> those things is true, the second one probably isn't. I don't doubt >> that less extreme variants of this scenario are possible, though. > > Yeah maybe, it does seem quite extreme, but maybe less so as the years roll > on a bit... perhaps in 5-10 years it might be quite common to have that many > spare CPU cores to throw at a task. That is certainly possible, but we need to start small. It's completely OK for the first version of this feature to have some rough edges that get improved later. Indeed, it's absolutely vital, or we'll never get this thing off the ground. > I think if we have this percentage GUC you mentioned to prefer parallel > plans if they're within a % threshold of the serial plan, then we could end > up with problems with I/O and buffers getting thrown out of caches due to > the extra I/O involved in parallel plans going with seq scans instead of > serial plans choosing index scans. That's possible, but the non-parallel planner doesn't account for caching effects, either. > In summary it sounds like with my idea we get: > > Pros > * Optimal plan if no workers are available at execution time. > * Parallelism possible if the chosen optimal plan happens to support > parallelism, e.g not index scan. > * No planning overhead The third one isn't really true. You've just moved some of the planning to execution time. > Cons: > * The plan "Parallelizer" must make changes to the plan just before > execution time, which ruins the 1 to 1 ratio of plan/executor nodes by the > time you inject Funnel nodes. > > If we parallelise during planning time: > > Pros > * More chance of getting a parallel friendly plan which could end up being > very fast if we get enough workers at executor time. This, to me, is by far the biggest "con" of trying to do something at execution time. If planning doesn't take into account the gains that are possible from parallelism, then you'll only be able to come up with the best parallel plan when it happens to be a parallelized version of the best serial plan. So long as the only parallel operator is parallel seq scan, that will probably be a common scenario. But once we assemble a decent selection of parallel operators, and a reasonably intelligent parallel query optimizer, I'm not so sure it'll still be true. > Cons: > * May produce non optimal plans if no worker processes are available during > execution time. > * Planning overhead for considering parallel paths. > * The parallel plan may blow out buffer caches due to increased I/O of > parallel plan. > > Of course please say if I've missed any pro or con. I think I generally agree with your list; but we might not agree on the relative importance of the items on it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers