Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )
Change subject: KUDU-1125: issue one catalog write per tablet report ...................................................................... Patch Set 7: Code-Review+1 (4 comments) lgtm after you address Mike's comments http://gerrit.cloudera.org:8080/#/c/8090/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8090/6//COMMIT_MSG@10 PS6, Line 10: When the : master is configured to fsync WAL writes, this can add a lot of load during : election storms and when the master is restarted. > Haven't done any perf testing yet. I was planning to do it post-merge, but That's fair. If it's going to take a lot of effort I don't think it's worth trying to hit this issue. It might also take a lot of machines since we won't generate writes for TOMBSTONED tablets, and the non-tombstoned tablet count is limited by the thread scalability issues on the TS side. I'd skip doing anything fancy for now and just try to enable fsync on a cluster with a reasonable number of tablets and see if there's a noticeable difference in the TSHeartbeat latency profile (there should be an enormous one) http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3106 PS6, Line 3106: "requestor", rpc->requestor_string(), > Todd, I'm not sure how your example can help measure latency. Did you mean well, /rpcz already keeps samples based on the overall latency of the request handling. For each sampling bucket (eg >1sec) it'll keep a single sampled trace, which includes the metrics. So the idea is that if we go to /rpcz and look at the ">1sec TSHeartbeat" sample, we'll have a useful metric to see whether the report was particularly large or something. http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3285 PS6, Line 3285: // leader elected. We need to wait for a leader before marking a tablet : // as RUNNING, or else > This TODO is trying to draw attention to the fact that ShouldTransitionTabl Ack http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3173 PS7, Line 3173: // reasonable to ignore it and wait for an operator fix the situation. > The end of this comment only makes sense after going to the flag definition I had a similar thought but seems out of scope for this commit. Perhaps add a TODO to consider removing this flag? Seems like it's only used by one test, and that test's goal is just to see that the flag works :) -- To view, visit http://gerrit.cloudera.org:8080/8090 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b Gerrit-Change-Number: 8090 Gerrit-PatchSet: 7 Gerrit-Owner: Adar Dembo <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Fri, 06 Oct 2017 00:36:57 +0000 Gerrit-HasComments: Yes
