Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/13760 )
Change subject: KUDU-1938 Add support for CHAR/VARCHAR pt 1 ...................................................................... Patch Set 18: (11 comments) http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/common/partial_row.h File src/kudu/common/partial_row.h: http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/common/partial_row.h@150 PS17, Line 150: /// Set the binary/string/char/varchar value for a column by name, copying the : /// specified data im > This too (and in the other variant below). Done http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/common/schema.h File src/kudu/common/schema.h: http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/common/schema.h@123 PS16, Line 123: > Makes sense, we just need to make sure it's documented in the code where ap Done http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h File src/kudu/util/char_util.h: http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h@28 PS17, Line 28: namespace kudu { > FYI, slices are cheap enough that when we need to pass them read-only in ne Done http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h@40 PS17, Line 40: s, the number of : // bytes in num_bytes. > This is confusing: if you want to store a byte quantity, name the parameter Renamed it, but it's necessary for the truncate/resize methods to know both the number of bytes and the UTF8 characters at the same offset. http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h@41 PS17, Line 41: e max_utf8_length parameter determines the max : // number of UTF8 characters to be read to > Is this optimization worth making? Are we super concerned about counting to It's not really an optimization as much as a necessity to stop counting here so the num_bytes would align with num_utf8_chars http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h@46 PS17, Line 46: void UTF8Length(const Slice val, size_t* num_utf8_chars, > Does 'length' indicate the new slice length? Is it a byte quantity or a UTF Changed it to max_utf8_length, should be clearer now. It doesn't need to be less than val.length(), we only use it as a maximum. The first n bytes correlating to the firs max_utf8_length characters of the val.data() are copied to a new location as it's used in KuduPartialRow::SetCharVarchar where it's expected to memcpy (i.e. SetString, SetBinary both call SetSliceCopy which uses memcpy as well). http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h@50 PS17, Line 50: Slice UTF8Truncate(const Slice val, size_t max_utf8_length); > Why does this return an std::string? It returns an std::string because it's used in the scanner. The server serializes all binary type columns to a single string (indirect_data) and stores the offsets and sizes in the fixed width data (direct_data). Then this is passed through the wire and on the client side these offsets and sizes are converted into Slices pointing to locations within this single string. Due to this there's no way to resize the slices in the getters. Copying data to new slices is also impossible as the Slice doesn't manage the memory of the data it points to so it has to be freed up manually after the client no longer needs it. It should be possible to copy all data with the necessary padding after the message is received in `RewriteRowBlockPointers`, but that would require iterating through all rows at least one more time, appending every single binary data to a new faststring/std::string. First I opted to do the CHAR expansion in the sending end of the wire when the data is serialized in `SerializeRowBlock`. This way the receiving end doesn't have to do any extra work (in addition to saving on the memcpy and extra iteration on the rows, less complex code on the client side) and the extra spaces are still not persisted on the disk and the predicates are applied correctly too. The expansion applies only to CHARs anyway which are max 255 chars now as you suggested and the client side would've had to rewrite all binary data even if there's a single CHAR column in the projection. I then realized it would be much simpler to use std::string instead of Slice here so the allocated memory doesn't get freed up before the client could read it and it wouldn't break existing API as GetChar is a new method, so this is why this is also returning an std::string. I could dig out my previous approach from reflog if that's better. > What does it mean for the memory pointed to by 'val'? It leaves it unchanged. > Is 'length' a byte quantity or a UTF8 character quantity? It's UTF8, changed the parameter name to reflect this. > Can it be less than val.length()? It can, it would truncate in this case. Hopefully this is also clear now from the name of the parameter. > If the string is expanded, is the excess padded? Yes, added a comment to explain it. http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/util/char_util.cc File src/kudu/util/char_util.cc: http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/util/char_util.cc@27 PS16, Line 27: > Does this duplicate functionality from gutil/utf? There is a similar method [utfnlen(const char*, long n)], but the n represents the maximum number of bytes instead of maximum number of UTF8 characters + it doesn't set char_length (renamed to num_bytes) which we need in resize/truncate http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/util/char_util.cc@46 PS16, Line 46: size_t num_bytes = 0; > It's a little confusing as to why we need three different kinds of length h Removed this part from here as I realized we don't need this, but explained it in the resize_to_str (renamed to UTF8Resize) + renamed the variables as well to make more sense. http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/util/char_util.cc@66 PS16, Line 66: > Add a "using std::string" at the top and drop this. Done http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.cc File src/kudu/util/char_util.cc: PS17: > Looks like you missed the review comments in this file. Done -- To view, visit http://gerrit.cloudera.org:8080/13760 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I998982dba93831db91c43a97ce30d3e68c2a4a54 Gerrit-Change-Number: 13760 Gerrit-PatchSet: 18 Gerrit-Owner: Attila Bukor <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-Comment-Date: Tue, 16 Jul 2019 05:13:38 +0000 Gerrit-HasComments: Yes
