Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13733 )
Change subject: Modify table scan tool to collect and surface errors instead of crashing ...................................................................... Patch Set 2: (11 comments) Looks pretty good! http://gerrit.cloudera.org:8080/#/c/13733/1//COMMIT_MSG Commit Message: PS1: nit: For commit messages, we try to follow the guidance outlined here: https://chris.beams.io/posts/git-commit/ http://gerrit.cloudera.org:8080/#/c/13733/1//COMMIT_MSG@7 PS1, Line 7: Modify table scan tool to collect and surface errors instead of crashing nit: Could you prefix this with KUDU-2851 so it's easy to associate this patch with the ticket? e.g. KUDU-2851: modify table scan tool to collect and surface errors instead of crashing http://gerrit.cloudera.org:8080/#/c/13733/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13733/2//COMMIT_MSG@9 PS2, Line 9: Initialized a Status for each thread in StartWork(). : Replaced CHECKs in ScanData() with return of bad Status. : ScanTask() or CopyTask() catch bad Statuses from ScanData() in : out-parameter back to StartWork(). : StartWork() logs and returns the first bad Status from the thread pool. nit: Since this is a list of stuff, could you maybe add bullets (e.g. *s or -s) before each item, and adjust the spacing so it's a little easier to read? E.g. ... - ScanTask() or Copy... out-parameter back... - StartWork() logs a... Also, these are good details, but you probably don't _need_ to go _that_ in-depth to the point where you're calling out specific functions. A high-level description of the approach should suffice. http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/kudu-tool-test.cc File src/kudu/tools/kudu-tool-test.cc: http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/kudu-tool-test.cc@5004 PS1, Line 5004: // Now shut down the tablet servers so the scans cannot proceed. : // Upon running the scan tool, we should get a TimedOut status. : NO_FATALS(cluster_->ShutdownNodes(cluster::ClusterNodes::TS_ONLY)); The goal here was to make sure that our scan hit an error, and shutting down the Tablet Servers was a means to that end. An side effect of that decision, as you saw, is that we have to wait for our scan to time out, which can take a while. Instead, how about we use a different way to inject errors to the scan? Let's instead set the following flags on the Tablet Servers to inject errors: - tserver_enforce_access_control=true - tserver_inject_invalid_authz_token_ratio=1.0 Rather than having to wait for a timeout, this will have the scans make their ways to the Tablet Servers, and immediately be rejected because they're not authorized. An example of setting flags on a cluster can be found in integration-tests/disk_failure-itest.cc at L203. http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.h File src/kudu/tools/table_scanner.h: http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.h@84 PS1, Line 84: const std::function<void(const kudu::client::KuduScanBatch& batch)>& cb); nit: Could you fix the spacing alignment here? http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc File src/kudu/tools/table_scanner.cc: http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@467 PS1, Line 467: const std::function<void(const KuduScanBatch& batch)>& cb) { nit: could you fix the spacing alignment here? http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@476 PS1, Line 476: scan_status = token->IntoKuduScanner(&scanner_ptr); : if (!scan_status.ok()) return scan_status; nit: You can use the RETURN_NOT_OK macro and it does this for you. http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@605 PS1, Line 605: vector<Status> thread_statuses; : for (i = 0; i < FLAGS_num_threads; ++i) { : thread_statuses.emplace_back(Status::OK()); : } nit: The default constructor for a Status object in Kudu is Status::OK(). If you supply an int to a vector's constructor, it'll initialize the members with the default constructor. E.g. vector<Status> thread_statuses(FLAGS_num_threads); http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@624 PS1, Line 624: boost::bind(&TableScanner::CopyTask, this, thread_tokens[i], &thread_statuses[i]))); Since you're fixing this too, could you add a similar test for copying? Alternatively, don't do this here and you could perhaps address this in a second patch. http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@631 PS1, Line 631: for (i = 0; i < FLAGS_num_threads; ++i) { There's a decent amount of stuff happening in this function. One thing of note is that we're timing how long it takes to scan. The end of this timing period is specified by sw.stop(), and then a timing message based on the scan is output below. Rather than exiting early, let's still print out the scanned rows and the timing info, but _additionally_ print out any errors and returning the first bad one. http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@633 PS1, Line 633: LOG(INFO) << "Scanning failed " << thread_statuses[i].ToString() << endl; In tooling, we prefer to log to stdout. Here, that would be *out_. Also, instead of printing the first error, I think it'd be more usable to print out all of the errors, e.g. one on each line. WDYT? -- To view, visit http://gerrit.cloudera.org:8080/13733 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic45da537b8bacfa9625010536ea82da9a6e76100 Gerrit-Change-Number: 13733 Gerrit-PatchSet: 2 Gerrit-Owner: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Wed, 26 Jun 2019 00:59:00 +0000 Gerrit-HasComments: Yes
