Todd Lipcon has posted comments on this change.

Change subject: KUDU-1398 CFile index blocks can store shortest separating 
prefix
......................................................................


Patch Set 4:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/3304/4//COMMIT_MSG
Commit Message:

PS4, Line 17: compatibility with bloomfile
what's this mean?


http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/binary_dict_block.cc
File src/kudu/cfile/binary_dict_block.cc:

Line 131:     last_key_.assign_copy(last->data(), last->size());
instead of copying this every time, could we use GetLast on the code word 
builder, then look it up in our dictionary? What you have here will probably 
hurt write performance


http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/binary_plain_block.cc
File src/kudu/cfile/binary_plain_block.cc:

Line 132: Status BinaryPlainBlockBuilder::GetLastKey(void *key_void) const {
can we refactor this to share code with GetFirstKey? eg a GetKeyAtIndex(int 
idx)?


http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/cfile_util.cc
File src/kudu/cfile/cfile_util.cc:

Line 126: void SeparatingKey(const faststring& left, const faststring& right, 
faststring *target) {
can you add some unit tests for this function? (eg the examples from the commit 
message)


Line 127:   DCHECK(memcmp(left.data(), right.data(), min(left.size(), 
right.size())) <=
I think DCHECK_LE(Slice(left).compare(Slice(right), 0); is probably more 
readable. Especially if you change this function to take slices as arguments


Line 130:   if (left.size() == 0) {
.empty()


Line 131:     // special case for first block: use the whole key
not sure why this is the case


Line 135:     target->assign_copy(right.data(), cpl == right.size() ? cpl : cpl 
+ 1);
hm, this is a slightly different algorithm than the one used in 
gutil/strings/util.h (FindShortestSeparator). eg their example:

  // FindShortestSeparator("foobar", "foxhunt", &sep) => sep == "fop"

you would return 'fox', right? Either one is the same length, but I guess 
theirs ends up tighter to the left bound whereas yours ends up somewhere in the 
middle. Can you think of any advantages/disadvantages between the two? As far 
as I can think, either is correct, but maybe I'm missing something.


http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/cfile_util.h
File src/kudu/cfile/cfile_util.h:

Line 63:   bool deltafile;
I think I'd prefer this be called something like 'optimize_index_keys' or 
something, rather than specifically referring to whether it's a deltafile.

Another option is to have an int 'preserve_index_key_prefix_length' or 
somesuch, where for delta keys we pass the maximum possible encoded length of a 
delta key. This ensures that we can still decode it, but we might still be able 
to truncate the value (since the delta entries themselves can be arbitrarily 
long with user data)


Line 110: // Computes the shortest key satisfying left < key < right and places 
it into target.
should clarify whether 'target' may be the same faststring as 'left' or 'right' 
or not


Line 111: void SeparatingKey(const faststring& left, const faststring& right, 
faststring *target);
rename to GetSeparatingKey


PS4, Line 111: const faststring& left, const faststring& right
may make sense to use StringPiece or Slice here rather than specifically 
faststring


http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/cfile_writer.cc
File src/kudu/cfile/cfile_writer.cc:

Line 141:     last_key_.clear();
no need (it's initialized to clear)


Line 437:       SeparatingKey(last_key_, tmp_buf_, &sep_buf);
maybe make SeparatingKey actually just shorten tmp_buf_ in place to avoid the 
extra allocation here


http://gerrit.cloudera.org:8080/#/c/3304/4/src/kudu/cfile/rle_block.h
File src/kudu/cfile/rle_block.h:

Line 265:     if (count_ > 0) {
usually we put the error case first


-- 
To view, visit http://gerrit.cloudera.org:8080/3304
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I68ae9146fabd4a19b17d103d118d2d60e28bb315
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wdberke...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com>
Gerrit-HasComments: Yes

Reply via email to