Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17903 )
Change subject: KUDU-1959 - Implement server startup progress page for tablet and master servers ...................................................................... Patch Set 2: (8 comments) http://gerrit.cloudera.org:8080/#/c/17903/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17903/1//COMMIT_MSG@11 PS1, Line 11: * In case of log block manager, we expose the below metrics: : - log_block_manager_total_containers_startup : total containers present, : - log_block_manager_processed_containers_startup : count of containers : opened/processed until the requested instant of time and : - log_block_manager_containers_processing_time_startup : time elapsed : for opening the containers. If all the containers are not yet opened, : we provide the time elapsed so far. : * In case of tablet server, we expose the below metrics: : - tablets_num_total_startup : total tablets present, : - tablets_num_opened_startup : count of tablets opening/processed until : the requested instant of time and : - tablets_opening_time_startup : time elapsed for opening the tablets. : If all the tablets are not yet opened, we provide the time elapsed so : far. Please split this patch into two separate patches, one for each of these. http://gerrit.cloudera.org:8080/#/c/17903/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17903/2//COMMIT_MSG@7 PS2, Line 7: Implement server startup progress page for tablet and master servers This doesn't describe this patch, does it? Doesn't this patch add metrics, not a progress page? http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/fs/fs_manager.cc File src/kudu/fs/fs_manager.cc: http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/fs/fs_manager.cc@112 PS2, Line 112: METRIC_DEFINE_gauge_int64(server, log_block_manager_containers_processing_time_startup, : "Time taken to open all log block containers during server startup", : kudu::MetricUnit::kMilliseconds, : "The total time taken by the server to open all the container" : "files during the startup", : kudu::MetricLevel::kInfo); See my other thoughts on elapsed time metrics. http://gerrit.cloudera.org:8080/#/c/17903/1/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: http://gerrit.cloudera.org:8080/#/c/17903/1/src/kudu/fs/log_block_manager.cc@2561 PS1, Line 2561: ->set_value(containers_total->load()) Rather than loading from the atomic (which isn't free), how about just incrementing by containers_seen.size()? http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/server/startup_path_handler.cc File src/kudu/server/startup_path_handler.cc: http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/server/startup_path_handler.cc@147 PS2, Line 147: Seem's we're missing the tablets processed counter. Though see my below comment on this aggregated metric -- I'm not sure we get much from exposing this aggregation anyway http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/server/startup_path_handler.cc@142 PS2, Line 142: int32 StartupPathHandler::StartupProgressPercentageMetric() { : int percent = 0; : percent += (init_progress_.IsStopped()) ? 100 : 0; : percent += (read_filesystem_progress_.IsStopped()) ? 100 : 0; : percent += (start_rpc_server_progress_.IsStopped()) ? 100 : 0; : if (is_tablet_server_) { : if (containers_total_ == 0 && read_data_directories_progress_.IsStopped()) { : percent += 100; : } else if (containers_total_ != 0) { : percent += containers_processed_ * 100 / containers_total_; : } : } : else { : percent += (initialize_master_catalog_progress_.IsStopped()) ? 100 : 0; : } : return percent/4; : } I'm not convinced that this adds much value, over the other metrics added in this patch. It's a pretty weird concept to reason about. E.g. if the first 50% took 10ms, that doesn't really tell me much about the remaining 50%. http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/server/startup_path_handler.cc@159 PS2, Line 159: : int64_t StartupPathHandler::StartupProgressTimeElapsedMetric() { : int64_t time_elapsed = 0; : time_elapsed += init_progress_.TimeElapsed().ToMilliseconds(); : time_elapsed += read_filesystem_progress_.TimeElapsed().ToMilliseconds(); : if (is_tablet_server_) { : time_elapsed += start_tablets_progress_.TimeElapsed().ToMilliseconds(); : } : else { : time_elapsed += initialize_master_catalog_progress_.TimeElapsed().ToMilliseconds(); : } : time_elapsed += start_rpc_server_progress_.TimeElapsed().ToMilliseconds(); : return time_elapsed; : } : I'm similarly unconvinced that this is a useful metric. How do you expect this to be used, vs just exposing the tablets/containers processed metrics? http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/tserver/ts_tablet_manager.cc@1381 PS2, Line 1381: METRIC_tablets_opening_time_startup.Instantiate(server_->metric_entity(), : (start_tablets->TimeElapsed()).ToMilliseconds()); It seems a bit odd to have a "time" metric that only gets updated based on calls to this function, instead of having some direct reading of the timer via a function gauge. E.g. successive calls to this metric won't have different values unless UpdateStartupProgress() has been called in between That said, I wonder if the time elapsed metric is actually useful for anything, over what's in the web UI. Maybe we should scrap it and just keep the tablets_num_opened_startup and tablets_num_total_startup metrics, since they're more straightforward to reason about. -- To view, visit http://gerrit.cloudera.org:8080/17903 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0a508c3baf0a0d77baf75f36f7bb305a6ad821e1 Gerrit-Change-Number: 17903 Gerrit-PatchSet: 2 Gerrit-Owner: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Mon, 11 Oct 2021 19:44:40 +0000 Gerrit-HasComments: Yes
