Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/17152 )
Change subject: KUDU-3254 fix bug in meta-cache exposed by KUDU-1802 ...................................................................... Patch Set 6: (5 comments) > (2 comments) > > Thanks for the fix and test. This test should also be added to > master right? > > It looks like there are test failures due to unrelated python infra > issues on the branch. Yep, I'm planning to back-port the test scenario to the main branch from 1.13.x: the 'fix' is already there. http://gerrit.cloudera.org:8080/#/c/17152/6/src/kudu/client/meta_cache.cc File src/kudu/client/meta_cache.cc: http://gerrit.cloudera.org:8080/#/c/17152/6/src/kudu/client/meta_cache.cc@932 PS6, Line 932: if (remote.get() != nullptr) { > Ah, so this issue only exists if a range partition was dropped and then rep Yep: in pre-KUDU-1802 case, the crash could only happen if the range partition was dropped and then replaced with the same range again. After KUDU-1802, the crash happened even if a range was just dropped, when the client instance was fed with 'stale' scan tokens. In the test scenario, the RANGE_DROPPED case is the essence of the culprit, the rest are just variations on top. I added sub-scenarios for larger and smaller ranges in there just to explicitly document how that case works w.r.t. discovering the corresponding 'mapped' ranges, so it's clear what rows are read given a set of 'stale' scan tokens. http://gerrit.cloudera.org:8080/#/c/17152/6/src/kudu/client/meta_cache.cc@943 PS6, Line 943: auto* entry = FindOrNull(tablets_by_key, tablet_lower_bound); > I generally disagree that we shouldn't use FindOrDie in production code. It Yep, I agree: FindOrDie() should be used when there isn't a way to gracefully handle the situation otherwise. For example, in many cases continuing further could damage the integrity of the data or even worse -- that's definitely a case to use FindOrDie(). In all other cases, it makes sense to use FindOrNull() or alike, sure. FWIW, I think it makes sense to revise the usage of FindOrDie() in the code since I suspect that in many cases FindOrDie() is used only because it was easier to write it like that and not handle the error otherwise, even if it were possible to handle the error condition gracefully. http://gerrit.cloudera.org:8080/#/c/17152/6/src/kudu/client/meta_cache.cc@955 PS6, Line 955: > nit: replace with spaces Done http://gerrit.cloudera.org:8080/#/c/17152/6/src/kudu/client/scan_token-test.cc File src/kudu/client/scan_token-test.cc: http://gerrit.cloudera.org:8080/#/c/17152/6/src/kudu/client/scan_token-test.cc@950 PS6, Line 950: > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/17152/6/src/kudu/client/scan_token-test.cc@1052 PS6, Line 1052: [[fallthrough]]; > nit: maybe use FALLTHROUGH_INTENDED? Or maybe we should deprecate that in f OK, I guess keeping it uniform makes more sense: replaced with FALLTHROUGH_INTENDED. I think I can post a separate patch replacing FALLTHROUGH_INTENDED with [[fallthrough]]. -- To view, visit http://gerrit.cloudera.org:8080/17152 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.13.x Gerrit-MessageType: comment Gerrit-Change-Id: I5b8370290c13b1e496f461ed5bc2e0193bdf4b19 Gerrit-Change-Number: 17152 Gerrit-PatchSet: 6 Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Sun, 07 Mar 2021 07:18:25 +0000 Gerrit-HasComments: Yes
