> -----Original Message----- > From: Robert Haas [mailto:robertmh...@gmail.com] > Sent: Thursday, February 04, 2016 2:54 AM > To: Kaigai Kouhei(海外 浩平) > Cc: email@example.com > Subject: ##freemail## Re: [HACKERS] CustomScan under the Gather node? > > On Thu, Jan 28, 2016 at 8:14 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote: > >> total ForeignScan diff > >> 0 workers: 17584.319 ms 17555.904 ms 28.415 ms > >> 1 workers: 18464.476 ms 18110.968 ms 353.508 ms > >> 2 workers: 19042.755 ms 14580.335 ms 4462.420 ms > >> 3 workers: 19318.254 ms 12668.912 ms 6649.342 ms > >> 4 workers: 21732.910 ms 13596.788 ms 8136.122 ms > >> 5 workers: 23486.846 ms 14533.409 ms 8953.437 ms > >> > >> This workstation has 4 CPU cores, so it is natural nworkers=3 records the > >> peak performance on ForeignScan portion. On the other hands, nworkers>1 > >> also > >> recorded unignorable time consumption (probably, by Gather node?) > > : > >> Further investigation will need.... > >> > > It was a bug of my file_fdw patch. ForeignScan node in the master process > > was > > also kicked by the Gather node, however, it didn't have coordinate > > information > > due to oversight of the initialization at InitializeDSMForeignScan callback. > > In the result, local ForeignScan node is still executed after the completion > > of coordinated background worker processes, and returned twice amount of > > rows. > > > > In the revised patch, results seems to me reasonable. > > total ForeignScan diff > > 0 workers: 17592.498 ms 17564.457 ms 28.041ms > > 1 workers: 12152.998 ms 11983.485 ms 169.513 ms > > 2 workers: 10647.858 ms 10502.100 ms 145.758 ms > > 3 workers: 9635.445 ms 9509.899 ms 125.546 ms > > 4 workers: 11175.456 ms 10863.293 ms 312.163 ms > > 5 workers: 12586.457 ms 12279.323 ms 307.134 ms > > Hmm. Is the file_fdw part of this just a demo, or do you want to try > to get that committed? If so, maybe start a new thread with a more > appropriate subject line to just talk about that. I haven't > scrutinized that part of the patch in any detail, but the general > infrastructure for FDWs and custom scans to use parallelism seems to > be in good shape, so I rewrote the documentation and committed that > part. > Thanks, I expect file_fdw part is just for demonstration. It does not require any special hardware to reproduce this parallel execution, rather than GpuScan of PG-Strom.
> Do you have any idea why this isn't scaling beyond, uh, 1 worker? > That seems like a good thing to try to figure out. > The hardware I run the above query has 4 CPU cores, so it is not surprising that 3 workers (+ 1 master) recorded the peak performance. In addition, enhancement of file_fdw part is a corner-cutting work. It picks up the next line number to be fetched from the shared memory segment using pg_atomic_add_fetch_u32(), then it reads the input file until worker meets the target line. Unrelated line shall be ignored. Individual worker parses its responsible line only, thus, parallel execution makes sense in this part. On the other hands, total amount of CPU cycles for file scan will increase because all the workers at least have to parse all the lines. If we would simply split time consumption factor in 0 worker case as follows: (time to scan file; TSF) + (time to parse lines; TPL) Total amount of workloads when we distribute file_fdw into N workers is: N * (TSF) + (TPL) Thus, individual worker has to process the following amount of works: (TSF) + (TPL)/N It is a typical formula of Amdahl's law when sequencial part is not small. The above result says, TSF part is about 7.4s, TPL part is about 10.1s. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kai...@ak.jp.nec.com> -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers