[
https://issues.apache.org/jira/browse/KUDU-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351135#comment-17351135
]
Abhishek commented on KUDU-1959:
--------------------------------
As a first step towards this issue, we could have a tablet server startup page
which shows the progress of the starting up. Guess we could break down the
tablet server startup into a few phases (something like initializing, reading
the metadata directory, reading the data directory, bootstrapping, connecting
to the masters). The usual major time consuming phases are reading the log
block containers (reading the data directory) and bootstrapping the tablets.
For these two phases we can include the total LBM containers/tablets present
and the ones which are processed until that time to keep a track of the
progress of the startup.
Now for the question, how do we get total LBM containers - since we do not have
any metric for that yet (Even if we had this would have been reset after the
restart of the server), we could just get the number of data files in the
presented data directories.
The total tablets present is obtainable after scanning the metadata directory.
In the current state we start the tablet server WebUI while the tablets are in
bootstrapping phase. We could startup the WebUI before this phase but just
start the Tablet server startup progress page and load the other pages once we
get to the bootstrapping phase.
> Hard to tell when a cluster is done starting up
> -----------------------------------------------
>
> Key: KUDU-1959
> URL: https://issues.apache.org/jira/browse/KUDU-1959
> Project: Kudu
> Issue Type: Improvement
> Components: ops-tooling
> Reporter: Jean-Daniel Cryans
> Assignee: Abhishek
> Priority: Major
> Labels: roadmap-candidate, usability
>
> Restarting a cluster that has a good amount of data, it's hard to tell when
> it's "done". Right now the things I do:
> - Run ksck, wait until most tablets are not in "unavailable" or
> "boostrapping" state.
> - Watch the metrics and see when the data under management is close to where
> it was before restarting (it grows as tablets are getting bootstrapped).
> - Look at the tablet server web UIs for tablets, compare how many are done
> bootstrapping VS in the process of VS not started.
> Ideas on how to improve this:
> - In the master's web UI for tablet servers, show how many tablets are
> running VS not running (I wouldn't add anything about tombstoned tablets)
> - Add metrics for tablets in different states.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)