On Sat, 15 Feb 2020 at 14:32, Amit Kapila <amit.kapil...@gmail.com> wrote: > Good point and I agree with you that having a single process would > avoid any such stuff. However, I will think some more on it and if > you/anyone else gets some idea on how to deal with this in a > multi-worker system (where we can allow each worker to read and > process the chunk) then feel free to share your thoughts.
I think having a single process handle splitting the input into tuples makes most sense. It's possible to parse csv at multiple GB/s rates [1], finding tuple boundaries is a subset of that task. My first thought for a design would be to have two shared memory ring buffers, one for data and one for tuple start positions. Reader process reads the CSV data into the main buffer, finds tuple start locations in there and writes those to the secondary buffer. Worker processes claim a chunk of tuple positions from the secondary buffer and update their "keep this data around" position with the first position. Then proceed to parse and insert the tuples, updating their position until they find the end of the last tuple in the chunk. Buffer size, maximum and minimum chunk size could be tunable. Ideally the buffers would be at least big enough to absorb one of the workers getting scheduled out for a timeslice, which could be up to tens of megabytes. Regards, Ants Aasma [1] https://github.com/geofflangdale/simdcsv/