On Mon, Jan 4, 2016 at 4:50 AM, Dilip Kumar <dilipbal...@gmail.com> wrote: > I tried to create a inner table such that, inner table data don't fit in RAM > (I created VM with 1GB Ram). > Purpose of this is to make Disk scan dominant, > and since parallel join is repeating the Disk Scan and hash table building > of inner table, so there will be lot of Parallel I/O and it has to pay > penalty. > > I think even though, inner table scanning and hash table building is > parallel, but there will be lot of parallel I/O which will become > bottleneck.
Hmm. Because only 1/1024th of the hash table fits in work_mem, the executor is going to have to write out all of the tuples that don't belong to the first batch to a temporary file and then read them back in. So each backend is going to write essentially the entirety of t2 out to disk and then read it all back in again. The non-parallel case will also write most of the table contents and then read them back in, but at least it will only be doing that once rather than 7 times, so it's not as bad. Also, with fewer backends running, the non-parallel case will have a bit more memory free for caching purposes. > Do we need to consider the cost for parallel i/o also, i am not sure can we > really do that... ? It seems to me that the problem here is that you've set max_parallel_degree to an unrealistically high value. The query planner is entitled to assume that the user has set max_parallel_degree to a value which is small enough that the workers won't be fighting too viciously with each other over resources. It doesn't really matter whether those resources are CPU resources or I/O resources. I'm wondering if your 1GB VM really even has as many as 7 vCPUs, because that would seem to be something of an unusual configuration - and if it doesn't, then setting max_parallel_degree to a value that high is certainly user error. Even if it does, it's still not right to set the value as high as six unless the system also has enough I/O bandwidth to accommodate the amount of I/O that you expect your queries to generate, and here it seems like it probably doesn't. To put that another way, you can always make parallel query perform badly by telling it to use too many workers relative to the size of the machine you have. This is no different than getting bad query plans by configuring work_mem or effective_cache_size or any other query planner GUC to a value that doesn't reflect the actual execution environment. I would only consider this to be a problem with the parallel join patch if the chosen plan is slower even on a machine that's big enough to justify setting max_parallel_degree=6 in the first place. While studying this example, I thought about whether we try to fix this case by generating a partial hash join path only if we expect a single batch, which would then cause the query planner to plan this query some other way. But after some thought I don't think that's the right approach. Multi-batch hash joins are in general quite a lot slower than single-batch hash joins - and initial_cost_hashjoin knows that - but if the system has adequate I/O bandwidth, that problem shouldn't be any worse for a parallel hash join than it is for a non-parallel hash join. I think the reason you're losing here is because the system either doesn't have as many vCPUs as the number of worker processes you are giving it, or it has a very limited amount of I/O bandwidth that can't handle multiple processes doing sequential I/O at the same time - e.g. a single spinning disk, or a single SSD plus a bunch of virtualization overhead. But that need not be the case. On a system where temporary files are written to a filesystem backend by an array of disks, you might well get some I/O parallelism. Of course if we experiment and find that doesn't work out well for some reason, then we've got a problem, but it doesn't seem implausible that it might be just fine. Another interesting question about this particular query is whether a merge join would have been faster, especially given all Peter Geoghegan's work to improve sort performance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers