On Mon, 13 Apr 2020 at 23:16, Andres Freund <and...@anarazel.de> wrote: > > Still, if the reader does the splitting, then you don't need as much > > IPC, right? The shared memory data structure is just a ring of bytes, > > and whoever reads from it is responsible for the rest. > > I don't think so. If only one process does the splitting, the > exclusively locked section is just popping off a bunch of offsets of the > ring. And that could fairly easily be done with atomic ops (since what > we need is basically a single producer multiple consumer queue, which > can be done lock free fairly easily ). Whereas in the case of each > process doing the splitting, the exclusively locked part is splitting > along lines - which takes considerably longer than just popping off a > few offsets.
I see the benefit of having one process responsible for splitting as being able to run ahead of the workers to queue up work when many of them need new data at the same time. I don't think the locking benefits of a ring are important in this case. At current rather conservative chunk sizes we are looking at ~100k chunks per second at best, normal locking should be perfectly adequate. And chunk size can easily be increased. I see the main value in it being simple. But there is a point that having a layer of indirection instead of a linear buffer allows for some workers to fall behind. Either because the kernel scheduled them out for a time slice, or they need to do I/O or because inserting some tuple hit an unique conflict and needs to wait for a tx to complete or abort to resolve. With a ring buffer reading has to wait on the slowest worker reading its chunk. Having workers copy the data to a local buffer as the first step would reduce the probability of hitting any issues. But still, at GB/s rates, hiding a 10ms timeslice of delay would need 10's of megabytes of buffer. FWIW. I think just increasing the buffer is good enough - the CPUs processing this workload are likely to have tens to hundreds of megabytes of cache on board.