Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19392 )
Change subject: [www] add slow scans show ...................................................................... Patch Set 11: (25 comments) Thank you for adding this new functionality. It seems to be useful. Just curious whether you considered an alternative approach: adding an extra 'is_slow' field into the scan descriptor and updating it correspondingly, and in the UI of the embedded web server it's possible to sort by that field (more details in the comment for the Commit Message). http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG Commit Message: PS11: Could you add a test for the newly introduced ScannerManager::ListSlowScans()? tablet_server-test.cc or scan_token-test.cc look like good places to add a new test scenario. Tablet servers have a special flag --scanner_inject_latency_on_each_batch_ms to inject latency into scanning process, so it's easy to make a scan 'slow' even if it scans just a single row. http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@7 PS11, Line 7: show section http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@9 PS11, Line 9: historical completed scans the history of completed scans http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@10 PS11, Line 10: convenient for us to get slow scans easy to distinguish slow ones among those http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@12 PS11, Line 12: change the previous scan presentation into completed scan show and : add slow scans show introduced an extra section to the scan page to show slow scans separately http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@13 PS11, Line 13: Slow scan defined only by the time used, which : threshold is --slow_scanner_threshold_ms A scan is called 'slow' if it takes more time than defined by --slow_scanner_threshold_ms http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@15 PS11, Line 15: The number of slow scans show is limit by --slow_scan_history_count The number of elements in the slow scans history is limited by --slow_scan_history_count http://gerrit.cloudera.org:8080/#/c/19392/11//COMMIT_MSG@17 PS11, Line 17: Change-Id: Ibd1dcac8b81c5eefd306e7020c9c52b3f28e603c BTW, did you consider an alternative approach: instead of two separate lists (with possibility of duplicating elements), just add an extra field into the ScanDescriptor: 'is_slow' that's assigned correspondingly. Now, in the WebUI that would be an extra column that is possible to sort the list by, so by sorting the list by that column the slow scans would be in the beginning or the end of the list depending on how many times the column header was clicked at. What do you think? http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h File src/kudu/tserver/scanners.h: http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@115 PS11, Line 115: scoped_refptr<ScanDescriptor> Consider introducing a typedef for scoped_refptr<ScanDescriptor> (say, SharedScanDescriptor) and use it everywhere? http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@117 PS11, Line 117: recently recent http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@118 PS11, Line 118: Slow scan defined only by the time used, which threshold is : // --slow_scanner_threshold_ms. A scan is 'slow' if it takes more than --slow_scanner_threshold_ms to complete. http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@120 PS11, Line 120: The member number of return vector is limit by --slow_scan_history_count The number of elements in the result vector is limited by --slow_scan_history_count. http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@126 PS11, Line 126: whose scan exceeds time threshold whose scan times exceed the threshold http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@478 PS11, Line 478: ScanDescriptor is copyable If I'm not mistaken, ScanDescriptor is copyable even without wrapping it into RefCounted. It would be great to document the precise rationale behind switching to the wrapper for ScanDescriptor. Probably, you meant that that's to avoid duplicating data for elements in slow_scans_ and completed_scans_? http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.h@479 PS11, Line 479: : public RefCountedThreadSafe<ScanDescriptor> { Just curious: why RefCountedThreadSafe seems to be more preferable over std::shared_ptr in this case? http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc File src/kudu/tserver/scanners.cc: http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@68 PS11, Line 68: 60000 nit: for better readability, consider replacing with 60 * 1000L http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@68 PS11, Line 68: minutes minute http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@69 PS11, Line 69: Number of milliseconds for the threshold of slow scan." What do you think of introducing a special value for the flag (e.g., -1 or 0) that would mean no scan is considered slow? Since we don't know what's the prevalent workload for a Kudu cluster, it could turn out that every scan is a full table scan with huge amount of data returned, so even if 60 seconds looks like a reasonable default value, TODO http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@75 PS11, Line 75: The slow " : "scans define with --slow_scanner_threshold_ms The threshold for a slow scan is defined with --slow_scanner_threshold_ms http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@153 PS11, Line 153: { : std::lock_guard<percpu_rwlock> l(completed_scans_lock_); : completed_scans_.clear(); : } : { : std::lock_guard<percpu_rwlock> l(slow_scans_lock_); : slow_scans_.clear(); : } Why is it needed? The containers would be automatically cleaned up on destruction by the C++ runtime. http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@291 PS11, Line 291: desc Why not to continue using the move semantics here? http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@321 PS11, Line 321: unordered_map<string, scoped_refptr<ScanDescriptor>> scans; : { : kudu::shared_lock<rw_spinlock> l(slow_scans_lock_.get_lock()); : for (const auto& scan : slow_scans_) { : InsertOrUpdate(&scans, scan->scanner_id, scan); : } : } What's the purpose of using unordered dictionary/map here if later on the result container is sorted anyways? Why not to use result vector 'ret' right away? http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@355 PS11, Line 355: FLAGS_slow_scanner_threshold_ms There are 3 accesses to this flag in the scope of this method. Every access is an extra call and it involves some synchronization while fetching the value. Consider fetching the value of the flag just once, storing it in the local variable, and then re-using the local variable in the scope of this method. http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@361 PS11, Line 361: LOG(INFO) << Substitute( : "Slow scanner id: $0, of tablet $1, " : "exceed the time threshold $2 ms for $3 ms.", : it->first, : scanner->tablet_id(), : FLAGS_slow_scanner_threshold_ms, : delta_time.ToMilliseconds()); Do you really want this to be logged for every scanner which goes beyond the defined 'slow' threshold? My concern is that this might pollute the INFO log in case of heavy workload of many scans when the slow threshold isn't configured to take into account the regular patterns in the cluster. http://gerrit.cloudera.org:8080/#/c/19392/11/src/kudu/tserver/scanners.cc@436 PS11, Line 436: if (slow_scans_.capacity() == 0) { : return; : } : if (slow_scans_.size() == slow_scans_.capacity()) { : slow_scans_[slow_scans_offset_++] = descriptor; : if (slow_scans_offset_ == slow_scans_.capacity()) { : slow_scans_offset_ = 0; : } : } else { : slow_scans_.emplace_back(descriptor); : } Instead of duplicating this code from RecordCompletedScanUnlocked(), consider extracting this circular buffer's logic into a function and use it both in ScannerManager::RecordCompletedScanUnlocked() and ScannerManager::RecordSlowScanUnlocked() method. -- To view, visit http://gerrit.cloudera.org:8080/19392 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibd1dcac8b81c5eefd306e7020c9c52b3f28e603c Gerrit-Change-Number: 19392 Gerrit-PatchSet: 11 Gerrit-Owner: KeDeng <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Sun, 22 Jan 2023 06:06:45 +0000 Gerrit-HasComments: Yes
