Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17903 )

Change subject: KUDU-1959 - Implement server startup progress page for tablet 
and master servers
......................................................................


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/17903/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17903/1//COMMIT_MSG@11
PS1, Line 11: * In case of log block manager, we expose the below metrics:
            :   - log_block_manager_total_containers_startup : total containers 
present,
            :   - log_block_manager_processed_containers_startup : count of 
containers
            :     opened/processed until the requested instant of time and
            :   - log_block_manager_containers_processing_time_startup : time 
elapsed
            :     for opening the containers. If all the containers are not yet 
opened,
            :     we provide the time elapsed so far.
            : * In case of tablet server, we expose the below metrics:
            :   - tablets_num_total_startup : total tablets present,
            :   - tablets_num_opened_startup : count of tablets 
opening/processed until
            :     the requested instant of time and
            :   - tablets_opening_time_startup : time elapsed for opening the 
tablets.
            :     If all the tablets are not yet opened, we provide the time 
elapsed so
            :     far.
Please split this patch into two separate patches, one for each of these.


http://gerrit.cloudera.org:8080/#/c/17903/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17903/2//COMMIT_MSG@7
PS2, Line 7: Implement server startup progress page for tablet and master 
servers
This doesn't describe this patch, does it? Doesn't this patch add metrics, not 
a progress page?


http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/fs/fs_manager.cc
File src/kudu/fs/fs_manager.cc:

http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/fs/fs_manager.cc@112
PS2, Line 112: METRIC_DEFINE_gauge_int64(server, 
log_block_manager_containers_processing_time_startup,
             :                           "Time taken to open all log block 
containers during server startup",
             :                           kudu::MetricUnit::kMilliseconds,
             :                           "The total time taken by the server to 
open all the container"
             :                           "files during the startup",
             :                           kudu::MetricLevel::kInfo);
See my other thoughts on elapsed time metrics.


http://gerrit.cloudera.org:8080/#/c/17903/1/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

http://gerrit.cloudera.org:8080/#/c/17903/1/src/kudu/fs/log_block_manager.cc@2561
PS1, Line 2561: ->set_value(containers_total->load())
Rather than loading from the atomic (which isn't free), how about just 
incrementing by containers_seen.size()?


http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/server/startup_path_handler.cc
File src/kudu/server/startup_path_handler.cc:

http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/server/startup_path_handler.cc@147
PS2, Line 147:
Seem's we're missing the tablets processed counter. Though see my below comment 
on this aggregated metric -- I'm not sure we get much from exposing this 
aggregation anyway


http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/server/startup_path_handler.cc@142
PS2, Line 142: int32 StartupPathHandler::StartupProgressPercentageMetric() {
             :   int percent = 0;
             :   percent += (init_progress_.IsStopped()) ? 100 : 0;
             :   percent += (read_filesystem_progress_.IsStopped()) ? 100 : 0;
             :   percent += (start_rpc_server_progress_.IsStopped()) ? 100 : 0;
             :   if (is_tablet_server_) {
             :     if (containers_total_ == 0 && 
read_data_directories_progress_.IsStopped()) {
             :       percent += 100;
             :     } else if (containers_total_ != 0) {
             :       percent += containers_processed_ * 100 / containers_total_;
             :     }
             :   }
             :   else {
             :     percent += (initialize_master_catalog_progress_.IsStopped()) 
? 100 : 0;
             :   }
             :   return percent/4;
             : }
I'm not convinced that this adds much value, over the other metrics added in 
this patch. It's a pretty weird concept to reason about. E.g. if the first 50% 
took 10ms, that doesn't really tell me much about the remaining 50%.


http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/server/startup_path_handler.cc@159
PS2, Line 159:
             : int64_t StartupPathHandler::StartupProgressTimeElapsedMetric() {
             :   int64_t time_elapsed = 0;
             :   time_elapsed += init_progress_.TimeElapsed().ToMilliseconds();
             :   time_elapsed += 
read_filesystem_progress_.TimeElapsed().ToMilliseconds();
             :   if (is_tablet_server_) {
             :     time_elapsed += 
start_tablets_progress_.TimeElapsed().ToMilliseconds();
             :   }
             :   else {
             :     time_elapsed += 
initialize_master_catalog_progress_.TimeElapsed().ToMilliseconds();
             :   }
             :   time_elapsed += 
start_rpc_server_progress_.TimeElapsed().ToMilliseconds();
             :   return time_elapsed;
             : }
             :
I'm similarly unconvinced that this is a useful metric. How do you expect this 
to be used, vs just exposing the tablets/containers processed metrics?


http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/17903/2/src/kudu/tserver/ts_tablet_manager.cc@1381
PS2, Line 1381:       
METRIC_tablets_opening_time_startup.Instantiate(server_->metric_entity(),
              :           (start_tablets->TimeElapsed()).ToMilliseconds());
It seems a bit odd to have a "time" metric that only gets updated based on 
calls to this function, instead of having some direct reading of the timer via 
a function gauge. E.g. successive calls to this metric won't have different 
values unless UpdateStartupProgress() has been called in between

That said, I wonder if the time elapsed metric is actually useful for anything, 
over what's in the web UI. Maybe we should scrap it and just keep the 
tablets_num_opened_startup and tablets_num_total_startup metrics, since they're 
more straightforward to reason about.



--
To view, visit http://gerrit.cloudera.org:8080/17903
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0a508c3baf0a0d77baf75f36f7bb305a6ad821e1
Gerrit-Change-Number: 17903
Gerrit-PatchSet: 2
Gerrit-Owner: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 11 Oct 2021 19:44:40 +0000
Gerrit-HasComments: Yes

Reply via email to