Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/12540 )
Change subject: Reduce startup log spam ...................................................................... Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/12540/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12540/1//COMMIT_MSG@66 PS1, Line 66: I0220 13:18:51.483634 230735872 tablet_bootstrap.cc:438] T 27705b0198da406d8301830cd942c7ee P 0a909baebce949a6aa4cdc0f196ecd00: Bootstrap replayed 1/1 log segments. Stats: ops{read=63 overwritten=0 applied=63 ignored=62} inserts{seen=0 ignored=0} mutations{seen=0 ignored=0} orphaned_commits=0. Pending: 0 replicates > I'm hesitant about removing this one. I'm not a bootstrapping expert, but u I'm very conflicted about this message and the timing messages. They are per tablet, so there's a lot of them on a busy server, and no one is going to go and look at them all. I think the main purpose of putting all of them in the logs is to make you feel like something is happening when you restart a large cluster and tail -f an INFO log. Extremes and averages can be useful. What I'd like to say is that we should gather outcomes from bootstraps and expose that information somehow. One difficulty is that there isn't a "bootstrap phase" the way there's an LBM startup phase, a tmeta load phase, etc., so there's no moment in time when it's appropriate to stop and say "here's a summary of all the bootstrappin'". There may always be one or more tablets bootstrapping on a server, because of table creation and tablet copies. >From the user's point of view, what would information about bootstrapping be >good for? Most of all, I think users want to see when the tserver is roughly >"available". The /tablets page shows this by tracking the percentage of live >replicas in each state. If they see other statistics, there's not really >anything to do about them. An expert ought to use tracing.html and other advanced tools to see inside a bootstrap, so I think I should double check that tracing of bootstrap is really good. The missing pieces are how to identify that bootstrap in general is really slow and how to identify individual tablets that may be slow to bootstrap. /tablets is pretty ok for the first task. For the second, maybe we ought to expose more metrics or info about outliers? It's not clear how to do that well. One simple thing would be to publish metrics from the open tablet thread pool. This would capture queue times plus run times, where the run time measures bootstrap + start. A final important point here is that startup problems are hard to reproduce- they require restarting. Slow scans or writes can usually be reproduced whenever, and so examined with /stacks, tracing, metrics, etc much more easily. So maybe excessive logging during startup is justified so that there's evidence to bring to experts. Anyway, these were just some off the cuff thoughts. Let me know what you think. -- To view, visit http://gerrit.cloudera.org:8080/12540 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3793a2385612cf920a94e5f62a559c350b8bf461 Gerrit-Change-Number: 12540 Gerrit-PatchSet: 1 Gerrit-Owner: Will Berkeley <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Will Berkeley <[email protected]> Gerrit-Comment-Date: Thu, 21 Feb 2019 01:45:30 +0000 Gerrit-HasComments: Yes
