Attila Bukor has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13760 )

Change subject: KUDU-1938 Add support for CHAR/VARCHAR pt 1
......................................................................


Patch Set 18:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/common/partial_row.h
File src/kudu/common/partial_row.h:

http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/common/partial_row.h@150
PS17, Line 150:   /// Set the binary/string/char/varchar value for a column by 
name, copying the
              :   /// specified data im
> This too (and in the other variant below).
Done


http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/common/schema.h
File src/kudu/common/schema.h:

http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/common/schema.h@123
PS16, Line 123:
> Makes sense, we just need to make sure it's documented in the code where ap
Done


http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h
File src/kudu/util/char_util.h:

http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h@28
PS17, Line 28: namespace kudu {
> FYI, slices are cheap enough that when we need to pass them read-only in ne
Done


http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h@40
PS17, Line 40: s, the number of
             :   // bytes in num_bytes.
> This is confusing: if you want to store a byte quantity, name the parameter
Renamed it, but it's necessary for the truncate/resize methods to know both the 
number of bytes and the UTF8 characters at the same offset.


http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h@41
PS17, Line 41: e max_utf8_length parameter determines the max
             :   // number of UTF8 characters to be read to
> Is this optimization worth making? Are we super concerned about counting to
It's not really an optimization as much as a necessity to stop counting here so 
the num_bytes would align with num_utf8_chars


http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h@46
PS17, Line 46:   void UTF8Length(const Slice val, size_t* num_utf8_chars,
> Does 'length' indicate the new slice length? Is it a byte quantity or a UTF
Changed it to max_utf8_length, should be clearer now. It doesn't need to be 
less than val.length(), we only use it as a maximum. The first n bytes 
correlating to the firs max_utf8_length characters of the val.data() are copied 
to a new location as it's used in KuduPartialRow::SetCharVarchar where it's 
expected to memcpy (i.e. SetString, SetBinary both call SetSliceCopy which uses 
memcpy as well).


http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.h@50
PS17, Line 50:   Slice UTF8Truncate(const Slice val, size_t max_utf8_length);
> Why does this return an std::string?

It returns an std::string because it's used in the scanner. The server 
serializes all binary type columns to a single string (indirect_data) and 
stores the offsets and sizes in the fixed width data (direct_data). Then this 
is passed through the wire and on the client side these offsets and sizes are 
converted into Slices pointing to locations within this single string.

Due to this there's no way to resize the slices in the getters. Copying data to 
new slices is also impossible as the Slice doesn't manage the memory of the 
data it points to so it has to be freed up manually after the client no longer 
needs it.

It should be possible to copy all data with the necessary padding after the 
message is received in `RewriteRowBlockPointers`, but that would require 
iterating through all rows at least one more time, appending every single 
binary data to a new faststring/std::string.

First I opted to do the CHAR expansion in the sending end of the wire when the 
data is serialized in `SerializeRowBlock`. This way the receiving end doesn't 
have to do any extra work (in addition to saving on the memcpy and extra 
iteration on the rows, less complex code on the client side) and the extra 
spaces are still not persisted on the disk and the predicates are applied 
correctly too. The expansion applies only to CHARs anyway which are max 255 
chars now as you suggested and the client side would've had to rewrite all 
binary data even if there's a single CHAR column in the projection.

I then realized it would be much simpler to use std::string instead of Slice 
here so the allocated memory doesn't get freed up before the client could read 
it and it wouldn't break existing API as GetChar is a new method, so this is 
why this is also returning an std::string. I could dig out my previous approach 
from reflog if that's better.

> What does it mean for the memory pointed to by 'val'?

It leaves it unchanged.

> Is 'length' a byte quantity or a UTF8 character quantity?

It's UTF8, changed the parameter name to reflect this.

> Can it be less than val.length()?

It can, it would truncate in this case. Hopefully this is also clear now from 
the name of the parameter.

> If the string is expanded, is the excess padded?

Yes, added a comment to explain it.


http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/util/char_util.cc
File src/kudu/util/char_util.cc:

http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/util/char_util.cc@27
PS16, Line 27:
> Does this duplicate functionality from gutil/utf?
There is a similar method [utfnlen(const char*, long n)], but the n represents 
the maximum number of bytes instead of maximum number of UTF8 characters + it 
doesn't set char_length (renamed to num_bytes) which we need in resize/truncate


http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/util/char_util.cc@46
PS16, Line 46:   size_t num_bytes = 0;
> It's a little confusing as to why we need three different kinds of length h
Removed this part from here as I realized we don't need this, but explained it 
in the resize_to_str (renamed to UTF8Resize) + renamed the variables as well to 
make more sense.


http://gerrit.cloudera.org:8080/#/c/13760/16/src/kudu/util/char_util.cc@66
PS16, Line 66:
> Add a "using std::string" at the top and drop this.
Done


http://gerrit.cloudera.org:8080/#/c/13760/17/src/kudu/util/char_util.cc
File src/kudu/util/char_util.cc:

PS17:
> Looks like you missed the review comments in this file.
Done



--
To view, visit http://gerrit.cloudera.org:8080/13760
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I998982dba93831db91c43a97ce30d3e68c2a4a54
Gerrit-Change-Number: 13760
Gerrit-PatchSet: 18
Gerrit-Owner: Attila Bukor <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Grant Henke <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-Comment-Date: Tue, 16 Jul 2019 05:13:38 +0000
Gerrit-HasComments: Yes

Reply via email to