Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )
Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator ...................................................................... Patch Set 3: (19 comments) http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG@19 PS3, Line 19: Because of this new behavior, while the MergeIterator will deduplicate : deleted rows if they are included in the result set, the UnionIterator : will not deduplicate them and instead return all instances found. > I might reword this to state that it's not possible to efficiently support Done http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG@22 PS3, Line 22: : One tangentially-related change in this patch is that TestMerge in : generic_iterators-test.cc was modified to no longer generate duplicate : non-deleted keys for merge testing > This makes sense given what you're trying to do, but it's a bit of a bummer I agree, and if there was a lot of value associated with making it continue to work the same way then I would have done it, but there's no customer for it. http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators-test.cc File src/kudu/common/generic_iterators-test.cc: PS3: > Could you add a test case for include_deleted_rows=true that uses these sim Hmm, I'll look into writing a test case using the VectorIterators. I originally looked into it and it didn't seem straightforward... but I don't recall exactly why right now. And sure, I'll work on measuring the perf impact of this patch. http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators-test.cc@232 PS3, Line 232: seen.insert(potential); > Nit: InsertOrDie. Or EmplaceOrDie (if it works). Done http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.h File src/kudu/common/generic_iterators.h: http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.h@38 PS3, Line 38: schema > Nit: schemas (since this is a requirement for every sub-iterator). Done http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc File src/kudu/common/generic_iterators.cc: http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc@350 PS1, Line 350: > This can be hoisted out of the outer loop, and maybe used to control whethe Yes I agree, maybe we can do this in Init() http://gerrit.cloudera.org:8080/#/c/12205/1/src/kudu/common/generic_iterators.cc@355 PS1, Line 355: remove_if(states_.begin(), states_.end(), [] (const unique_ptr<MergeIterState>& iter) { > This can overflow the RowBlock. We need to avoid doing that, but also make Good point, thanks http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc File src/kudu/common/generic_iterators.cc: http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@300 PS3, Line 300: : opts_(opts), > std::move(opts) clang-tidy complains when std::move()ing 'opts' because it's a trivial struct that reduces to a single bool, so that could be a pessimizing move. Once we add another field to the struct, clang-tidy will probably start to complain that we should be move()ing it. http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@419 PS3, Line 419: vector<MergeIterState*> smallest(states_.size()); > Can we micro-optimize the !include_deleted_rows case such that the operatio I'll work on measuring the impact of these changes and go from there, since it looks to me like there are mostly only ugly options if that is an optimization we need to do. http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@432 PS3, Line 432: for (const auto& iter : states_) { > There's a non-obvious state machine in here; could you document it? Done http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@439 PS3, Line 439: smallest = { iter.get() }; > Is this is a copy assignment? For the common case where there are no duplic Good observation. I think smallest.resize(1, iter.get()) would be more efficient than whatever this might be doing. http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@441 PS3, Line 441: smallest.push_back(iter.get()); > Nit: emplace_back for new code. Is there a reason to prefer emplace_back() over push_back() in general? I think it used to be in the GSG but it's since been removed. I find it confusing to conflate push_back() and emplace_back(), since the beauty of emplace is that it acts as an in-place constructor taking constructor arguments, whereas the vec.push_back(ptr) form seems more semantically appropriate in this case, where we are calling the trivial copy constructor of a primitive type. Even if it wasn't a primitive type, it would seem more descriptive and I believe just as efficient to simply call vec.push_back(std::move(foo)) if trying to use the move constructor and avoid temporary objects from being constructed. http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@465 PS3, Line 465: // We found the single live instance of the row. > In DEBUG builds, should we verify that any remaining instances are ghosts? Done http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@479 PS3, Line 479: RETURN_NOT_OK(CopyRow(row_to_return->next_row(), &dst_row, dst->arena())); > I think this can be factored out of both sides of the if statement if you d I don't understand this comment. http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@487 PS3, Line 487: // Remove exhausted sub-iterators from the list we merge from. > Any reason not to combine this loop with the one that advances the sub-iter Yeah... we can do better here. I'll update this. http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@504 PS3, Line 504: : // Maintain a record of the number of rows actually copied to the destination : // RowBlock. > I think this is mostly self-explanatory from the code. What would be intere Good call, done. http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@536 PS3, Line 536: return unique_ptr<RowwiseIterator>(new MergeIterator(opts, std::move(iters))); > std::move(opts) see above http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc File src/kudu/tablet/diff_scan-test.cc: http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc@61 PS3, Line 61: ASSERT_TRUE(order_mode == UNORDERED || order_mode == ORDERED) : << "unknown order mode: " << OrderMode_Name(order_mode); > Doesn't seem necessary to me; the list of values is just a few lines above. Done http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/rowset.cc File src/kudu/tablet/rowset.cc: http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/rowset.cc@105 PS3, Line 105: case ORDERED: { > Why do you need the new scope here? Same for tablet.cc. This must have been a spurious addition once needed before a refactor. I'll remove it in both places. -- To view, visit http://gerrit.cloudera.org:8080/12205 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee Gerrit-Change-Number: 12205 Gerrit-PatchSet: 3 Gerrit-Owner: Mike Percy <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Comment-Date: Thu, 28 Feb 2019 02:58:15 +0000 Gerrit-HasComments: Yes
