[kudu-CR] Reduce startup log spam

Will Berkeley (Code Review) Wed, 20 Feb 2019 17:47:24 -0800

Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12540 )


Change subject: Reduce startup log spam
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12540/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12540/1//COMMIT_MSG@66
PS1, Line 66: I0220 13:18:51.483634 230735872 tablet_bootstrap.cc:438] T 
27705b0198da406d8301830cd942c7ee P 0a909baebce949a6aa4cdc0f196ecd00: Bootstrap 
replayed 1/1 log segments. Stats: ops{read=63 overwritten=0 applied=63 
ignored=62} inserts{seen=0 ignored=0} mutations{seen=0 ignored=0} 
orphaned_commits=0. Pending: 0 replicates
> I'm hesitant about removing this one. I'm not a bootstrapping expert, but u
I'm very conflicted about this message and the timing messages. They are per 
tablet, so there's a lot of them on a busy server, and no one is going to go 
and look at them all. I think the main purpose of putting all of them in the 
logs is to make you feel like something is happening when you restart a large 
cluster and tail -f an INFO log. Extremes and averages can be useful. What I'd 
like to say is that we should gather outcomes from bootstraps and expose that 
information somehow. One difficulty is that there isn't a "bootstrap phase" the 
way there's an LBM startup phase, a tmeta load phase, etc., so there's no 
moment in time when it's appropriate to stop and say "here's a summary of all 
the bootstrappin'". There may always be one or more tablets bootstrapping on a 
server, because of table creation and tablet copies.

>From the user's point of view, what would information about bootstrapping be 
>good for? Most of all, I think users want to see when the tserver is roughly 
>"available". The /tablets page shows this by tracking the percentage of live 
>replicas in each state. If they see other statistics, there's not really 
>anything to do about them.

An expert ought to use tracing.html and other advanced tools to see inside a 
bootstrap, so I think I should double check that tracing of bootstrap is really 
good. The missing pieces are how to identify that bootstrap in general is 
really slow and how to identify individual tablets that may be slow to 
bootstrap. /tablets is pretty ok for the first task. For the second, maybe we 
ought to expose more metrics or info about outliers? It's not clear how to do 
that well. One simple thing would be to publish metrics from the open tablet 
thread pool. This would capture queue times plus run times, where the run time 
measures bootstrap + start.

A final important point here is that startup problems are hard to reproduce- 
they require restarting. Slow scans or writes can usually be reproduced 
whenever, and so examined with /stacks, tracing, metrics, etc much more easily. 
So maybe excessive logging during startup is justified so that there's evidence 
to bring to experts.

Anyway, these were just some off the cuff thoughts. Let me know what you think.



--
To view, visit http://gerrit.cloudera.org:8080/12540
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3793a2385612cf920a94e5f62a559c350b8bf461
Gerrit-Change-Number: 12540
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Will Berkeley <[email protected]>
Gerrit-Comment-Date: Thu, 21 Feb 2019 01:45:30 +0000
Gerrit-HasComments: Yes

[kudu-CR] Reduce startup log spam

Reply via email to