Song Jiacheng has posted comments on this change. ( http://gerrit.cloudera.org:8080/20166 )
Change subject: KUDU-3407 Avoid unchecked scheduling of flush operations. ...................................................................... Patch Set 31: (16 comments) Thank you for the review and your attention for such a long time. http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@11 PS29, Line 11: wal > nit: WAL Done http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@12 PS29, Line 12: eventually break : due to OOM > I'm curious what do you think was the actual reason behind that OOM conditi Yes, normally tablet servers reject write requests if their memory usage reach 80%, but sometimes the usage might be higher than 100% and got killed by system(The system memory runs out). I think the reason is that the running memory is getting higher because of the worse performance. I could always find that the running memory would be high if the cluster is lack of CPU or I/O. I think their root cause are the same--read/write operation takes longer. http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@16 PS29, Line 16: server is under memory pressure > By your experience with applying this functionality in a real cluster, woul Not really actually, this patch has been applied in over 100 clusters for about a year and the value run_non_memory_ops_prob is set to 0.2, I did not find any OOM because of this. The probability of running compactions is decreasing as the memory usage getting higher. So actually the behaviors while the memory usage is higher than 80% are the same. And with better performance, read/write operation will not take too long even if under memory pressure. http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc File src/kudu/util/maintenance_manager-test.cc: http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@314 PS29, Line 314: > nit: drop 'that' Done http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@316 PS29, Line 316: y times the operat > nit: usually it's called queue time or something, meaning how much time an Done http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@318 PS29, Line 318: e oper > nit: maybe, name this 'run_count_'? Done. http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@389 PS29, Line 389: > I guess it makes sense to rename this field since its semantics is differen Done http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@982 PS29, Line 982: tMaintenanceOp op1("perf_op", MaintenanceOp::HIGH_IO_USAGE) > What is the probability of actually choosing a memory flushing op here? In This test does not make too much sense for now. The test below is actually covering it, so delete it. http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@985 PS29, Line 985: p_time(MonoDelta::FromMilliseconds(5)); > nit for here and elsewhere for ASSERT_EQ(): the expected value comes first Done http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@998 PS29, Line 998: ister_self(true); > Even with probabilistic behavior, isn't it always possible at least to chec Done http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@999 PS29, Line 999: > Given this test scenario runs way over 3 seconds, please add SKIP_IF_SLOW_ Done http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1006 PS29, Line 1006: FLAGS_enable_maintenance_manager = true; : // Wait for the memory_op to run over 1000 times and then check the running times : // of other operations. > Why to output this if all these variables aren't even changing at this poin I thought the test is provided to the users to set different parameters and run multi times, done. http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1035 PS29, Line 1035: O) << Substitute(" > What is 'MaintenanceMgr num'? It is the thread name of maintenance manager. I will make it clear. http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1036 PS29, Line 1036: > Wrap this with NO_FATALS() ? Done http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1044 PS29, Line 1044: > Why to wait so long? Would be great to comment on this. Alternatively, ma Done http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1046 PS29, Line 1046: : > Is there a way to check for the number of operations still running or opera 50ms is enough, which is also accepted, I think.. -- To view, visit http://gerrit.cloudera.org:8080/20166 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idc2fd3a850cf99d54ef2980211b712468440ed80 Gerrit-Change-Number: 20166 Gerrit-PatchSet: 31 Gerrit-Owner: Song Jiacheng <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Ashwani Raina <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Song Jiacheng <[email protected]> Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Wang Xixu <[email protected]> Gerrit-Comment-Date: Fri, 20 Oct 2023 06:28:59 +0000 Gerrit-HasComments: Yes
