On Tue, 18 Feb 2020 at 12:20, Amit Kapila <[email protected]> wrote:
> This is something similar to what I had also in mind for this idea. I
> had thought of handing over complete chunk (64K or whatever we
> decide). The one thing that slightly bothers me is that we will add
> some additional overhead of copying to and from shared memory which
> was earlier from local process memory. And, the tokenization (finding
> line boundaries) would be serial. I think that tokenization should be
> a small part of the overall work we do during the copy operation, but
> will do some measurements to ascertain the same.
I don't think any extra copying is needed. The reader can directly
fread()/pq_copymsgbytes() into shared memory, and the workers can run
CopyReadLineText() inner loop directly off of the buffer in shared memory.
For serial performance of tokenization into lines, I really think a SIMD
based approach will be fast enough for quite some time. I hacked up the code in
the simdcsv project to only tokenize on line endings and it was able to
tokenize a CSV file with short lines at 8+ GB/s. There are going to be many
other bottlenecks before this one starts limiting. Patch attached if you'd
like to try that out.
Regards,
Ants Aasma
diff --git a/src/main.cpp b/src/main.cpp
index 9d33a85..2cf775c 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -185,7 +185,6 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
#endif
simd_input in = fill_input(buf+internal_idx);
uint64_t quote_mask = find_quote_mask(in, prev_iter_inside_quote);
- uint64_t sep = cmp_mask_against_input(in, ',');
#ifdef CRLF
uint64_t cr = cmp_mask_against_input(in, 0x0d);
uint64_t cr_adjusted = (cr << 1) | prev_iter_cr_end;
@@ -195,7 +194,7 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
#else
uint64_t end = cmp_mask_against_input(in, 0x0a);
#endif
- fields[b] = (end | sep) & ~quote_mask;
+ fields[b] = (end) & ~quote_mask;
}
for(size_t b = 0; b < SIMDCSV_BUFFERSIZE; b++){
size_t internal_idx = 64 * b + idx;
@@ -211,7 +210,6 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
#endif
simd_input in = fill_input(buf+idx);
uint64_t quote_mask = find_quote_mask(in, prev_iter_inside_quote);
- uint64_t sep = cmp_mask_against_input(in, ',');
#ifdef CRLF
uint64_t cr = cmp_mask_against_input(in, 0x0d);
uint64_t cr_adjusted = (cr << 1) | prev_iter_cr_end;
@@ -226,7 +224,7 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
// then outside the quotes with LF so it's OK to "and off"
// the quoted bits here. Some other quote convention would
// need to be thought about carefully
- uint64_t field_sep = (end | sep) & ~quote_mask;
+ uint64_t field_sep = (end) & ~quote_mask;
flatten_bits(base_ptr, base, idx, field_sep);
}
#undef SIMDCSV_BUFFERSIZE