On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapil...@gmail.com> wrote: > This is something similar to what I had also in mind for this idea. I > had thought of handing over complete chunk (64K or whatever we > decide). The one thing that slightly bothers me is that we will add > some additional overhead of copying to and from shared memory which > was earlier from local process memory. And, the tokenization (finding > line boundaries) would be serial. I think that tokenization should be > a small part of the overall work we do during the copy operation, but > will do some measurements to ascertain the same.
I don't think any extra copying is needed. The reader can directly fread()/pq_copymsgbytes() into shared memory, and the workers can run CopyReadLineText() inner loop directly off of the buffer in shared memory. For serial performance of tokenization into lines, I really think a SIMD based approach will be fast enough for quite some time. I hacked up the code in the simdcsv project to only tokenize on line endings and it was able to tokenize a CSV file with short lines at 8+ GB/s. There are going to be many other bottlenecks before this one starts limiting. Patch attached if you'd like to try that out. Regards, Ants Aasma
diff --git a/src/main.cpp b/src/main.cpp index 9d33a85..2cf775c 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -185,7 +185,6 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) { #endif simd_input in = fill_input(buf+internal_idx); uint64_t quote_mask = find_quote_mask(in, prev_iter_inside_quote); - uint64_t sep = cmp_mask_against_input(in, ','); #ifdef CRLF uint64_t cr = cmp_mask_against_input(in, 0x0d); uint64_t cr_adjusted = (cr << 1) | prev_iter_cr_end; @@ -195,7 +194,7 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) { #else uint64_t end = cmp_mask_against_input(in, 0x0a); #endif - fields[b] = (end | sep) & ~quote_mask; + fields[b] = (end) & ~quote_mask; } for(size_t b = 0; b < SIMDCSV_BUFFERSIZE; b++){ size_t internal_idx = 64 * b + idx; @@ -211,7 +210,6 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) { #endif simd_input in = fill_input(buf+idx); uint64_t quote_mask = find_quote_mask(in, prev_iter_inside_quote); - uint64_t sep = cmp_mask_against_input(in, ','); #ifdef CRLF uint64_t cr = cmp_mask_against_input(in, 0x0d); uint64_t cr_adjusted = (cr << 1) | prev_iter_cr_end; @@ -226,7 +224,7 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) { // then outside the quotes with LF so it's OK to "and off" // the quoted bits here. Some other quote convention would // need to be thought about carefully - uint64_t field_sep = (end | sep) & ~quote_mask; + uint64_t field_sep = (end) & ~quote_mask; flatten_bits(base_ptr, base, idx, field_sep); } #undef SIMDCSV_BUFFERSIZE