Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/21316 )
Change subject: [metrics] Add metrics for create and delete op time ...................................................................... Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/21316/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21316/1//COMMIT_MSG@11 PS1, Line 11: These monitoring metrics will be very helpful for analyzing : issues related to high CPU usage. > Thank you very much for your response. Yes: in general that makes sense, of course. I was just trying to say that creating/deleting a tablet replica is mostly disk IO, but not many CPU cycles. Adding metrics here and there might be a guessing game. If the goal is to spot CPU bottlenecks, I can also recommend using built-in tracing: https://kudu.apache.org/docs/troubleshooting.html#kudu_tracing In addition, running 'htop -p <kudu_tserver_pid>' and performing stracing, etc. could pin-point particular threads that consume a lot of CPU. http://gerrit.cloudera.org:8080/#/c/21316/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21316/4//COMMIT_MSG@11 PS4, Line 11: analyzing : issues related to high CPU usage How the stats on the duration of create/delete a tablet replica could help with analyzing high CPU usage scenarios FWIW, most of the activity while creating/deleting a tablet is attributed to disk IO. If you are interested in tracking CPU usage of some activity (not attributed to IO wait times), I could recommend taking a look at the 'reactor_load_percent' metric. I hope this helps. http://gerrit.cloudera.org:8080/#/c/21316/4/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: http://gerrit.cloudera.org:8080/#/c/21316/4/src/kudu/tserver/ts_tablet_manager.cc@273 PS4, Line 273: What is the significance of this 'on the current node' part? All the tablet metrics are attributed to the node where the tablet replica is hosted, no? If so, maybe drop this part? http://gerrit.cloudera.org:8080/#/c/21316/4/src/kudu/tserver/ts_tablet_manager.cc@274 PS4, Line 274: Why kInfo, not kDebug? Looking at metrics like 'tablets_opening_time_startup', it seems this sort of metric is something that would be used mostly for troubleshooting. http://gerrit.cloudera.org:8080/#/c/21316/4/src/kudu/tserver/ts_tablet_manager.cc@281 PS4, Line 281: ditto: maybe, kDebug is a better choice here? http://gerrit.cloudera.org:8080/#/c/21316/4/src/kudu/tserver/ts_tablet_manager.cc@1168 PS4, Line 1168: Shouldn't the delete_tablet_run_time_ metric be updated before return here as well? -- To view, visit http://gerrit.cloudera.org:8080/21316 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02bd52013caa94a33143cb16ff3831a49b74bac4 Gerrit-Change-Number: 21316 Gerrit-PatchSet: 1 Gerrit-Owner: KeDeng <kdeng...@gmail.com> Gerrit-Reviewer: Alexey Serbin <ale...@apache.org> Gerrit-Reviewer: KeDeng <kdeng...@gmail.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Tue, 07 May 2024 07:26:16 +0000 Gerrit-HasComments: Yes