Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13733 )

Change subject: Modify table scan tool to collect and surface errors instead of 
crashing
......................................................................


Patch Set 2:

(11 comments)

Looks pretty good!

http://gerrit.cloudera.org:8080/#/c/13733/1//COMMIT_MSG
Commit Message:

PS1:
nit: For commit messages, we try to follow the guidance outlined here:

https://chris.beams.io/posts/git-commit/


http://gerrit.cloudera.org:8080/#/c/13733/1//COMMIT_MSG@7
PS1, Line 7: Modify table scan tool to collect and surface errors instead of 
crashing
nit: Could you prefix this with KUDU-2851 so it's easy to associate this patch 
with the ticket? e.g.

KUDU-2851: modify table scan tool to collect and surface errors instead of 
crashing


http://gerrit.cloudera.org:8080/#/c/13733/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13733/2//COMMIT_MSG@9
PS2, Line 9: Initialized a Status for each thread in StartWork().
           : Replaced CHECKs in ScanData() with return of bad Status.
           : ScanTask() or CopyTask() catch bad Statuses from ScanData() in
           : out-parameter back to StartWork().
           : StartWork() logs and returns the first bad Status from the thread 
pool.
nit: Since this is a list of stuff, could you maybe add bullets (e.g. *s or -s) 
before each item, and adjust the spacing so it's a little easier to read? E.g.

...
- ScanTask() or Copy...
  out-parameter back...
- StartWork() logs a...

Also, these are good details, but you probably don't _need_ to go _that_ 
in-depth to the point where you're calling out specific functions. A high-level 
description of the approach should suffice.


http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/kudu-tool-test.cc@5004
PS1, Line 5004:   // Now shut down the tablet servers so the scans cannot 
proceed.
              :   // Upon running the scan tool, we should get a TimedOut 
status.
              :   
NO_FATALS(cluster_->ShutdownNodes(cluster::ClusterNodes::TS_ONLY));
The goal here was to make sure that our scan hit an error, and shutting down 
the Tablet Servers was a means to that end. An side effect of that decision, as 
you saw, is that we have to wait for our scan to time out, which can take a 
while. Instead, how about we use a different way to inject errors to the scan?

Let's instead set the following flags on the Tablet Servers to inject errors:
- tserver_enforce_access_control=true
- tserver_inject_invalid_authz_token_ratio=1.0

Rather than having to wait for a timeout, this will have the scans make their 
ways to the Tablet Servers, and immediately be rejected because they're not 
authorized.

An example of setting flags on a cluster can be found in 
integration-tests/disk_failure-itest.cc at L203.


http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.h
File src/kudu/tools/table_scanner.h:

http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.h@84
PS1, Line 84:                 const std::function<void(const 
kudu::client::KuduScanBatch& batch)>& cb);
nit: Could you fix the spacing alignment here?


http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc
File src/kudu/tools/table_scanner.cc:

http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@467
PS1, Line 467:                             const std::function<void(const 
KuduScanBatch& batch)>& cb) {
nit: could you fix the spacing alignment here?


http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@476
PS1, Line 476:     scan_status = token->IntoKuduScanner(&scanner_ptr);
             :     if (!scan_status.ok()) return scan_status;
nit: You can use the RETURN_NOT_OK macro and it does this for you.


http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@605
PS1, Line 605:   vector<Status> thread_statuses;
             :   for (i = 0; i < FLAGS_num_threads; ++i) {
             :     thread_statuses.emplace_back(Status::OK());
             :   }
nit: The default constructor for a Status object in Kudu is Status::OK(). If 
you supply an int to a vector's constructor, it'll initialize the members with 
the default constructor. E.g.

  vector<Status> thread_statuses(FLAGS_num_threads);


http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@624
PS1, Line 624:         boost::bind(&TableScanner::CopyTask, this, 
thread_tokens[i], &thread_statuses[i])));
Since you're fixing this too, could you add a similar test for copying? 
Alternatively, don't do this here and you could perhaps address this in a 
second patch.


http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@631
PS1, Line 631:   for (i = 0; i < FLAGS_num_threads; ++i) {
There's a decent amount of stuff happening in this function. One thing of note 
is that we're timing how long it takes to scan. The end of this timing period 
is specified by sw.stop(), and then a timing message based on the scan is 
output below.

Rather than exiting early, let's still print out the scanned rows and the 
timing info, but _additionally_ print out any errors and returning the first 
bad one.


http://gerrit.cloudera.org:8080/#/c/13733/1/src/kudu/tools/table_scanner.cc@633
PS1, Line 633:       LOG(INFO) << "Scanning failed " << 
thread_statuses[i].ToString() << endl;
In tooling, we prefer to log to stdout. Here, that would be *out_.

Also, instead of printing the first error, I think it'd be more usable to print 
out all of the errors, e.g. one on each line. WDYT?



--
To view, visit http://gerrit.cloudera.org:8080/13733
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic45da537b8bacfa9625010536ea82da9a6e76100
Gerrit-Change-Number: 13733
Gerrit-PatchSet: 2
Gerrit-Owner: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 26 Jun 2019 00:59:00 +0000
Gerrit-HasComments: Yes

Reply via email to