Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19392 )

Change subject: [www] add slow scans show
......................................................................


Patch Set 11:

(25 comments)

Thank you for adding this new functionality.  It seems to be useful.

Just curious whether you considered an alternative approach: adding an extra 
'is_slow' field into the scan descriptor and updating it correspondingly, and 
in the UI of the embedded web server it's possible to sort by that field (more 
details in the comment for the Commit Message).

http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG
Commit Message:

PS11:
Could you add a test for the newly introduced ScannerManager::ListSlowScans()?  
tablet_server-test.cc or scan_token-test.cc look like good places to add a new 
test scenario.

Tablet servers have a special flag --scanner_inject_latency_on_each_batch_ms to 
inject latency into scanning process, so it's easy to make a scan 'slow' even 
if it scans just a single row.


http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@7
PS11, Line 7: show
section


http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@9
PS11, Line 9: historical completed scans
the history of completed scans


http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@10
PS11, Line 10: convenient for us to get slow scans
easy to distinguish slow ones among those


http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@12
PS11, Line 12: change the previous scan presentation into completed scan show 
and
             : add slow scans show
introduced an extra section to the scan page to show slow scans separately


http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@13
PS11, Line 13: Slow scan defined only by the time used, which
             : threshold is --slow_scanner_threshold_ms
A scan is called 'slow' if it takes more time than defined by 
--slow_scanner_threshold_ms


http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@15
PS11, Line 15: The number of slow scans show is limit by 
--slow_scan_history_count
The number of elements in the slow scans history is limited by 
--slow_scan_history_count


http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@17
PS11, Line 17: Change-Id: Ibd1dcac8b81c5eefd306e7020c9c52b3f28e603c
BTW, did you consider an alternative approach: instead of two separate lists 
(with possibility of duplicating elements), just add an extra field into the 
ScanDescriptor: 'is_slow' that's assigned correspondingly.  Now, in the WebUI 
that would be an extra column that is possible to sort  the list by, so by 
sorting the list by that column the slow scans would be in the beginning or the 
end of the list depending on how many times the column header was clicked at.

What do you think?


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h
File src/kudu/tserver/scanners.h:

http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@115
PS11, Line 115: scoped_refptr<ScanDescriptor>
Consider introducing a typedef for scoped_refptr<ScanDescriptor> (say, 
SharedScanDescriptor) and use it everywhere?


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@117
PS11, Line 117: recently
recent


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@118
PS11, Line 118: Slow scan defined only by the time used, which threshold is
              :   // --slow_scanner_threshold_ms.
A scan is 'slow' if it takes more than --slow_scanner_threshold_ms to complete.


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@120
PS11, Line 120: The member number of return vector is limit by 
--slow_scan_history_count
The number of elements in the result vector is limited by 
--slow_scan_history_count.


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@126
PS11, Line 126: whose scan exceeds time threshold
whose scan times exceed the threshold


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@478
PS11, Line 478: ScanDescriptor is copyable
If I'm not mistaken, ScanDescriptor is copyable even without wrapping it into 
RefCounted.  It would be great to document the precise rationale behind 
switching to the wrapper for ScanDescriptor.  Probably, you meant that that's 
to avoid duplicating data for elements in slow_scans_ and completed_scans_?


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@479
PS11, Line 479: : public RefCountedThreadSafe<ScanDescriptor> {
Just curious: why RefCountedThreadSafe seems to be more  preferable over 
std::shared_ptr in this case?


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc
File src/kudu/tserver/scanners.cc:

http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@68
PS11, Line 68: 60000
nit: for better readability, consider replacing with 60 * 1000L


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@68
PS11, Line 68: minutes
minute


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@69
PS11, Line 69: Number of milliseconds for the threshold of slow scan."
What do you think of introducing a special value for the flag (e.g., -1 or 0) 
that would mean no scan is considered slow?  Since we don't know what's the 
prevalent workload for a Kudu cluster, it could turn out that every scan is a 
full table scan with huge amount of data returned, so even if 60 seconds looks 
like a reasonable default value, TODO


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@75
PS11, Line 75: The slow "
             :              "scans define with --slow_scanner_threshold_ms
The threshold for a slow scan is defined with --slow_scanner_threshold_ms


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@153
PS11, Line 153:   {
              :     std::lock_guard<percpu_rwlock> l(completed_scans_lock_);
              :     completed_scans_.clear();
              :   }
              :   {
              :     std::lock_guard<percpu_rwlock> l(slow_scans_lock_);
              :     slow_scans_.clear();
              :   }
Why is it needed?  The containers would be automatically cleaned up on 
destruction by the C++ runtime.


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@291
PS11, Line 291: desc
Why not to continue using the move semantics here?


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@321
PS11, Line 321:   unordered_map<string, scoped_refptr<ScanDescriptor>> scans;
              :   {
              :     kudu::shared_lock<rw_spinlock> 
l(slow_scans_lock_.get_lock());
              :     for (const auto& scan : slow_scans_) {
              :       InsertOrUpdate(&scans, scan->scanner_id, scan);
              :     }
              :   }
What's the purpose of using unordered dictionary/map here if later on the 
result container is sorted anyways?  Why not to use result vector 'ret' right 
away?


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@355
PS11, Line 355: FLAGS_slow_scanner_threshold_ms
There are 3 accesses to this flag in the scope of this method.  Every access is 
an extra call and it involves some synchronization while fetching the value.  
Consider fetching the value of the flag just once, storing it in the local 
variable, and then re-using the local variable in the scope of this method.


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@361
PS11, Line 361:       LOG(INFO) << Substitute(
              :           "Slow scanner id: $0, of tablet $1, "
              :           "exceed the time threshold $2 ms for $3 ms.",
              :           it->first,
              :           scanner->tablet_id(),
              :           FLAGS_slow_scanner_threshold_ms,
              :           delta_time.ToMilliseconds());
Do you really want this to be logged for every scanner which goes beyond the 
defined 'slow' threshold?  My concern is that this might pollute the INFO log 
in case of heavy workload of many scans when the slow threshold isn't 
configured to take into account the regular patterns in the cluster.


http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@436
PS11, Line 436:   if (slow_scans_.capacity() == 0) {
              :     return;
              :   }
              :   if (slow_scans_.size() == slow_scans_.capacity()) {
              :     slow_scans_[slow_scans_offset_++] = descriptor;
              :     if (slow_scans_offset_ == slow_scans_.capacity()) {
              :       slow_scans_offset_ = 0;
              :     }
              :   } else {
              :     slow_scans_.emplace_back(descriptor);
              :   }
Instead of duplicating this code from RecordCompletedScanUnlocked(), consider 
extracting this circular buffer's logic into a function and use it both in 
ScannerManager::RecordCompletedScanUnlocked() and 
ScannerManager::RecordSlowScanUnlocked() method.



--
To view, visit http://gerrit.cloudera.org:8080/19392
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibd1dcac8b81c5eefd306e7020c9c52b3f28e603c
Gerrit-Change-Number: 19392
Gerrit-PatchSet: 11
Gerrit-Owner: KeDeng <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Sun, 22 Jan 2023 06:06:45 +0000
Gerrit-HasComments: Yes

Reply via email to