Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12205 )
Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in MergeIterator ...................................................................... Patch Set 4: (4 comments) http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc File src/kudu/common/generic_iterators.cc: http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@441 PS3, Line 441: // If we have a candidate for smallest row, compare it against the > Is there a reason to prefer emplace_back() over push_back() in general? I t My take is that if I use emplace_back(), I don't have to think about whether I can move or copy; it allows both, and without any overhead for copying. So if we consistently use emplace_back(), that's just one less thing to think about. That's why I started pointing it out in my code reviews (for the past year or two). IIUC, push_back(std::move(foo)) always makes a copy of 'foo' but misleads readers into thinking that 'foo' is being moved. Hence the thought to avoid it in new code. http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@479 PS3, Line 479: // We found the single live instance of the row. > I don't understand this comment. I think you can deduplicate the two calls to CopyRow (one here and the other on L453) into one call just after if (!opts_.include_deleted_rows) { ... } else { ... } http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc File src/kudu/common/generic_iterators.cc: http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc@474 PS4, Line 474: int live_rows_found = 0; Could you make sure this doesn't trigger a not-used warning in RELEASE builds? http://gerrit.cloudera.org:8080/#/c/12205/4/src/kudu/common/generic_iterators.cc@503 PS4, Line 503: unordered_set<MergeIterState*> exhausted; I think this can be a simple vector; aren't we guaranteed that there are no duplicates _within_ a given sub-iterator? Regardless, I still think it'd be better to avoid this intermediate step and combine the two loops. I get that you want to acquire states_lock_ and do states_.erase() just once, but consider that if there were duplicates for the returned row, the number of duplicates is probably very low, so the overhead of multiple lock acquisition/vector erases may be outweighed by the allocation of the set/vector here. -- To view, visit http://gerrit.cloudera.org:8080/12205 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee Gerrit-Change-Number: 12205 Gerrit-PatchSet: 4 Gerrit-Owner: Mike Percy <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Comment-Date: Thu, 28 Feb 2019 06:11:56 +0000 Gerrit-HasComments: Yes
