Re: [HACKERS] Parallel COPY FROM execution

Pavel Stehule Fri, 30 Jun 2017 05:37:06 -0700

2017-06-30 14:23 GMT+02:00 Alex K <[email protected]>:

> Greetings pgsql-hackers,
>
> I am a GSOC student this year, my initial proposal has been discussed
> in the following thread
> https://www.postgresql.org/message-id/flat/7179F2FD-49CE-
> 4093-AE14-1B26C5DFB0DA%40gmail.com
>
> Patch with COPY FROM errors handling seems to be quite finished, so
> I have started thinking about parallelism in COPY FROM, which is the next
> point in my proposal.
>
> In order to understand are there any expensive calls in COPY, which
> can be executed in parallel, I did a small research. First, please, find
> flame graph of the most expensive copy.c calls during the 'COPY FROM file'
> attached (copy_from.svg). It reveals, that inevitably serial operations
> like
> CopyReadLine (<15%), heap_multi_insert (~15%) take less than 50% of
> time in summary, while remaining operations like heap_form_tuple and
> multiple checks inside NextCopyFrom probably can be executed well in
> parallel.
>
> Second, I have compared an execution time of 'COPY FROM a single large
> file (~300 MB, 50000000 lines)' vs. 'COPY FROM four equal parts of the
> original file executed in the four parallel processes'. Though it is a
> very rough test, it helps to obtain an overall estimation:
>
> Serial:
> real 0m56.571s
> user 0m0.005s
> sys 0m0.006s
>
> Parallel (x4):
> real 0m22.542s
> user 0m0.015s
> sys 0m0.018s
>
> Thus, it results in a ~60% performance boost per each x2 multiplication of
> parallel processes, which is consistent with the initial estimation.
>
>
the important use case is big table with lot of indexes. Did you test
similar case?


Regards

Pavel

Re: [HACKERS] Parallel COPY FROM execution

Reply via email to