On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapil...@gmail.com> wrote:
> This is something similar to what I had also in mind for this idea.  I
> had thought of handing over complete chunk (64K or whatever we
> decide).  The one thing that slightly bothers me is that we will add
> some additional overhead of copying to and from shared memory which
> was earlier from local process memory.  And, the tokenization (finding
> line boundaries) would be serial.  I think that tokenization should be
> a small part of the overall work we do during the copy operation, but
> will do some measurements to ascertain the same.

I don't think any extra copying is needed. The reader can directly
fread()/pq_copymsgbytes() into shared memory, and the workers can run
CopyReadLineText() inner loop directly off of the buffer in shared memory.

For serial performance of tokenization into lines, I really think a SIMD
based approach will be fast enough for quite some time. I hacked up the code in
the simdcsv  project to only tokenize on line endings and it was able to
tokenize a CSV file with short lines at 8+ GB/s. There are going to be many
other bottlenecks before this one starts limiting. Patch attached if you'd
like to try that out.

Regards,
Ants Aasma
diff --git a/src/main.cpp b/src/main.cpp
index 9d33a85..2cf775c 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -185,7 +185,6 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
 #endif
         simd_input in = fill_input(buf+internal_idx);
         uint64_t quote_mask = find_quote_mask(in, prev_iter_inside_quote);
-        uint64_t sep = cmp_mask_against_input(in, ',');
 #ifdef CRLF
         uint64_t cr = cmp_mask_against_input(in, 0x0d);
         uint64_t cr_adjusted = (cr << 1) | prev_iter_cr_end;
@@ -195,7 +194,7 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
 #else
         uint64_t end = cmp_mask_against_input(in, 0x0a);
 #endif
-        fields[b] = (end | sep) & ~quote_mask;
+        fields[b] = (end) & ~quote_mask;
       }
       for(size_t b = 0; b < SIMDCSV_BUFFERSIZE; b++){
         size_t internal_idx = 64 * b + idx;
@@ -211,7 +210,6 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
 #endif
       simd_input in = fill_input(buf+idx);
       uint64_t quote_mask = find_quote_mask(in, prev_iter_inside_quote);
-      uint64_t sep = cmp_mask_against_input(in, ',');
 #ifdef CRLF
       uint64_t cr = cmp_mask_against_input(in, 0x0d);
       uint64_t cr_adjusted = (cr << 1) | prev_iter_cr_end;
@@ -226,7 +224,7 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
     // then outside the quotes with LF so it's OK to "and off"
     // the quoted bits here. Some other quote convention would
     // need to be thought about carefully
-      uint64_t field_sep = (end | sep) & ~quote_mask;
+      uint64_t field_sep = (end) & ~quote_mask;
       flatten_bits(base_ptr, base, idx, field_sep);
   }
 #undef SIMDCSV_BUFFERSIZE

Reply via email to