Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11249 )
Change subject: KUDU-2469 pt 2: fail replicas on CFile corruption ...................................................................... Patch Set 9: (13 comments) http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/cfile/bloomfile.cc File src/kudu/cfile/bloomfile.cc: http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/cfile/bloomfile.cc@257 PS9, Line 257: return Status::Corruption("bloom file is compressed (compression not supported)", > I'm also wondering why compression or a missing value index are grounds for Yeah, I agree with returning some other error, or just not handling the corruption since it's not indicative of a disk-related corruption. It might crash, depending on the operation (e.g. if this is done in the context of Flush, it will, if on a scan, compaction, it won't). Regardless, failing the tablet seems inappropriate here. http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/cfile/cfile_reader.cc File src/kudu/cfile/cfile_reader.cc: http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/cfile/cfile_reader.cc@185 PS9, Line 185: RETURN_NOT_OK(ReadAndParseFooter(io_context)); > I think if we eliminate all the Corruption status that should actually be s Done http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/cfile/cfile_reader.cc@234 PS9, Line 234: .CloneAndPrepend("failed to parse CFile pre-header"); > Shouldn't the CloneAndPrepend be conditioned on !s.OK()? It could be, though CloneAndPrepend does that check already. http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/cfile/cfile_reader.cc@328 PS9, Line 328: RETURN_NOT_OK(VerifyChecksum(io_context, slices, checksum)); > Can we handle the checksum corruption here with RETURN_NOT_OK_HANDLE_CORRUP Done http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/fs/io_context.h File src/kudu/fs/io_context.h: http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/fs/io_context.h@30 PS9, Line 30: user > s/user/module/ based on the text in the examples. Done http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/integration-tests/disk_failure-itest.cc File src/kudu/integration-tests/disk_failure-itest.cc: http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/integration-tests/disk_failure-itest.cc@217 PS9, Line 217: ::testing::Values(ErrorType::CFILE_CORRUPTION)); > Shouldn't this parameterize on DISK_FAILURE too? Whoops yes http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/integration-tests/disk_failure-itest.cc@240 PS9, Line 240: // Wait a bit for some errors to occur. > Can we ASSERT_EVENTUALLY() on something here, to ensure that we only run th Done http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/integration-tests/disk_failure-itest.cc@250 PS9, Line 250: // Set up a workload that does both reads and writes. > But you're setting the number of write threads to 0... Done http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/integration-tests/disk_failure-itest.cc@272 PS9, Line 272: // Wait a bit for some errors to occur. > Same feedback about waiting for an error here. Done http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/tablet/deltafile.cc File src/kudu/tablet/deltafile.cc: http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/tablet/deltafile.cc@274 PS9, Line 274: RETURN_NOT_OK(reader_->Init(io_context)); > Same comment in this file as the ones in the bloomfile code. Can we depend Done http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/tablet/deltafile.cc@278 PS9, Line 278: return Status::Corruption("file does not have a value index!"); > Again, seems more like NotSupported than Corruption. Done http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/tablet/deltafile.cc@291 PS9, Line 291: return Status::Corruption("missing delta stats from the delta file metadata"); > Same here. Done http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/tserver/tablet_server-test.cc File src/kudu/tserver/tablet_server-test.cc: http://gerrit.cloudera.org:8080/#/c/11249/9/src/kudu/tserver/tablet_server-test.cc@1676 PS9, Line 1676: auto log_req = MakeScopedCleanup([&] { : LOG(INFO) << SecureDebugString(req); : }); > Does SCOPED_TRACE not work in here? Indeed, seems ASSERT_EVENTUALLY and SCOPED_TRACE don't like each other. It would break the build as: ../../src/kudu/tserver/tablet_server-test.cc:1674:77: note: ‘testing::internal::ScopedTrace gtest_trace_1683’ previously declared here I'll just tweak this a bit, since this is kind of ugly. -- To view, visit http://gerrit.cloudera.org:8080/11249 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I63d541443bc68c83fd0ca6d51315143fee04d50f Gerrit-Change-Number: 11249 Gerrit-PatchSet: 9 Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Comment-Date: Sat, 01 Sep 2018 07:11:29 +0000 Gerrit-HasComments: Yes
