Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in 
MergeIterator
......................................................................


Patch Set 3:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG@19
PS3, Line 19: Because of this new behavior, while the MergeIterator will 
deduplicate
            : deleted rows if they are included in the result set, the 
UnionIterator
            : will not deduplicate them and instead return all instances found.
> I might reword this to state that it's not possible to efficiently support
Done


http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG@22
PS3, Line 22:
            : One tangentially-related change in this patch is that TestMerge in
            : generic_iterators-test.cc was modified to no longer generate 
duplicate
            : non-deleted keys for merge testing
> This makes sense given what you're trying to do, but it's a bit of a bummer
I agree, and if there was a lot of value associated with making it continue to 
work the same way then I would have done it, but there's no customer for it.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators-test.cc
File src/kudu/common/generic_iterators-test.cc:

PS3:
> Could you add a test case for include_deleted_rows=true that uses these sim
Hmm, I'll look into writing a test case using the VectorIterators. I originally 
looked into it and it didn't seem straightforward... but I don't recall exactly 
why right now.

And sure, I'll work on measuring the perf impact of this patch.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators-test.cc@232
PS3, Line 232:       seen.insert(potential);
> Nit: InsertOrDie. Or EmplaceOrDie (if it works).
Done


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.h
File src/kudu/common/generic_iterators.h:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.h@38
PS3, Line 38: schema
> Nit: schemas (since this is a requirement for every sub-iterator).
Done


http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc@350
PS1, Line 350:
> This can be hoisted out of the outer loop, and maybe used to control whethe
Yes I agree, maybe we can do this in Init()


http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc@355
PS1, Line 355:       remove_if(states_.begin(), states_.end(), [] (const 
unique_ptr<MergeIterState>& iter) {
> This can overflow the RowBlock. We need to avoid doing that, but also make
Good point, thanks


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@300
PS3, Line 300:     : opts_(opts),
> std::move(opts)
clang-tidy complains when std::move()ing 'opts' because it's a trivial struct 
that reduces to a single bool, so that could be a pessimizing move. Once we add 
another field to the struct, clang-tidy will probably start to complain that we 
should be move()ing it.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@419
PS3, Line 419:   vector<MergeIterState*> smallest(states_.size());
> Can we micro-optimize the !include_deleted_rows case such that the operatio
I'll work on measuring the impact of these changes and go from there, since it 
looks to me like there are mostly only ugly options if that is an optimization 
we need to do.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@432
PS3, Line 432:     for (const auto& iter : states_) {
> There's a non-obvious state machine in here; could you document it?
Done


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@439
PS3, Line 439:         smallest = { iter.get() };
> Is this is a copy assignment? For the common case where there are no duplic
Good observation. I think smallest.resize(1, iter.get()) would be more 
efficient than whatever this might be doing.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@441
PS3, Line 441:         smallest.push_back(iter.get());
> Nit: emplace_back for new code.
Is there a reason to prefer emplace_back() over push_back() in general? I think 
it used to be in the GSG but it's since been removed. I find it confusing to 
conflate push_back() and emplace_back(), since the beauty of emplace is that it 
acts as an in-place constructor taking constructor arguments, whereas the 
vec.push_back(ptr) form seems more semantically appropriate in this case, where 
we are calling the trivial copy constructor of a primitive type. Even if it 
wasn't a primitive type, it would seem more descriptive and I believe just as 
efficient to simply call vec.push_back(std::move(foo)) if trying to use the 
move constructor and avoid temporary objects from being constructed.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@465
PS3, Line 465:           // We found the single live instance of the row.
> In DEBUG builds, should we verify that any remaining instances are ghosts?
Done


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@479
PS3, Line 479:       RETURN_NOT_OK(CopyRow(row_to_return->next_row(), &dst_row, 
dst->arena()));
> I think this can be factored out of both sides of the if statement if you d
I don't understand this comment.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@487
PS3, Line 487:     // Remove exhausted sub-iterators from the list we merge 
from.
> Any reason not to combine this loop with the one that advances the sub-iter
Yeah... we can do better here. I'll update this.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@504
PS3, Line 504:
             :   // Maintain a record of the number of rows actually copied to 
the destination
             :   // RowBlock.
> I think this is mostly self-explanatory from the code. What would be intere
Good call, done.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@536
PS3, Line 536:   return unique_ptr<RowwiseIterator>(new MergeIterator(opts, 
std::move(iters)));
> std::move(opts)
see above


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc
File src/kudu/tablet/diff_scan-test.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc@61
PS3, Line 61:   ASSERT_TRUE(order_mode == UNORDERED || order_mode == ORDERED)
            :       << "unknown order mode: " << OrderMode_Name(order_mode);
> Doesn't seem necessary to me; the list of values is just a few lines above.
Done


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/rowset.cc
File src/kudu/tablet/rowset.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/rowset.cc@105
PS3, Line 105:     case ORDERED: {
> Why do you need the new scope here? Same for tablet.cc.
This must have been a spurious addition once needed before a refactor. I'll 
remove it in both places.



--
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 3
Gerrit-Owner: Mike Percy <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Comment-Date: Thu, 28 Feb 2019 02:58:15 +0000
Gerrit-HasComments: Yes

Reply via email to