Re: Parallel copy

Andres Freund Thu, 09 Apr 2020 13:02:08 -0700

Hi, 

On April 9, 2020 12:29:09 PM PDT, Robert Haas <robertmh...@gmail.com> wrote:
>On Thu, Apr 9, 2020 at 2:55 PM Andres Freund <and...@anarazel.de>
>wrote:
>> I'm fairly certain that we do *not* want to distribute input data
>between processes on a single tuple basis. Probably not even below a
>few hundred kb. If there's any sort of natural clustering in the loaded
>data - extremely common, think timestamps - splitting on a granular
>basis will make indexing much more expensive. And have a lot more
>contention.
>
>That's a fair point. I think the solution ought to be that once any
>process starts finding line endings, it continues until it's grabbed
>at least a certain amount of data for itself. Then it stops and lets
>some other process grab a chunk of data.
>
>Or are you are arguing that there should be only one process that's
>allowed to find line endings for the entire duration of the load?


I've not yet read the whole thread. So I'm probably restating ideas.

Imo, yes, there should be only one process doing the chunking. For ilp, cache 
efficiency, but also because the leader is the only process with access to the 
network socket. It should load input data into one large buffer that's shared 
across processes. There should be a separate ringbuffer with tuple/partial 
tuple (for huge tuples) offsets. Worker processes should grab large chunks of 
offsets from the offset ringbuffer. If the ringbuffer is not full, the worker 
chunks should be reduced in size.  

Given that everything stalls if the leader doesn't accept further input data, 
as well as when there are no available splitted chunks, it doesn't seem like a 
good idea to have the leader do other work.


I don't think optimizing/targeting copy from local files, where multiple 
processes could read, is useful. COPY STDIN is the only thing that practically 
matters.

Andres


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Parallel copy

Reply via email to