Song Jiacheng has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20166 )

Change subject: KUDU-3407 Avoid unchecked scheduling of flush operations.
......................................................................


Patch Set 31:

(16 comments)

Thank you for the review and your attention for such a long time.

http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@11
PS29, Line 11: wal
> nit: WAL
Done


http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@12
PS29, Line 12: eventually break
             : due to OOM
> I'm curious what do you think was the actual reason behind that OOM conditi
Yes, normally tablet servers reject write requests if their memory usage reach 
80%, but sometimes the usage might be higher than 100% and got killed by 
system(The system memory runs out).
I think the reason is that the running memory is getting higher because of the 
worse performance. I could always find that the running memory would be high if 
the cluster is lack of CPU or I/O. I think their root cause are the 
same--read/write operation takes longer.


http://gerrit.cloudera.org:8080/#/c/20166/29//COMMIT_MSG@16
PS29, Line 16: server is under memory pressure
> By your experience with applying this functionality in a real cluster, woul
Not really actually, this patch has been applied in over 100 clusters for about 
a year and the value run_non_memory_ops_prob is set to 0.2, I did not find any 
OOM because of this.
The probability of running compactions is decreasing as the memory usage 
getting higher. So actually the behaviors while the memory usage is higher than 
80% are the same. And with better performance, read/write operation will not 
take too long even if under memory pressure.


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc
File src/kudu/util/maintenance_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@314
PS29, Line 314:
> nit: drop 'that'
Done


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@316
PS29, Line 316: y times the operat
> nit: usually it's called queue time or something, meaning how much time an
Done


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@318
PS29, Line 318: e oper
> nit: maybe, name this 'run_count_'?
Done.


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@389
PS29, Line 389:
> I guess it makes sense to rename this field since its semantics is differen
Done


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@982
PS29, Line 982: tMaintenanceOp op1("perf_op", MaintenanceOp::HIGH_IO_USAGE)
> What is the probability of actually choosing a memory flushing op here?  In
This test does not make too much sense for now. The test below is actually 
covering it, so delete it.


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@985
PS29, Line 985: p_time(MonoDelta::FromMilliseconds(5));
> nit for here and elsewhere for ASSERT_EQ(): the expected value comes first
Done


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@998
PS29, Line 998: ister_self(true);
> Even with probabilistic behavior, isn't it always possible at least to chec
Done


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@999
PS29, Line 999:
> Given this test scenario runs way over 3 seconds, please add  SKIP_IF_SLOW_
Done


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1006
PS29, Line 1006:   FLAGS_enable_maintenance_manager = true;
               :   // Wait for the memory_op to run over 1000 times and then 
check the running times
               :   // of other operations.
> Why to output this if all these variables aren't even changing at this poin
I thought the test is provided to the users to set different parameters and run 
multi times, done.


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1035
PS29, Line 1035: O) << Substitute("
> What is 'MaintenanceMgr num'?
It is the thread name of maintenance manager. I will make it clear.


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1036
PS29, Line 1036:               
> Wrap this with NO_FATALS() ?
Done


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1044
PS29, Line 1044:
> Why to wait so long?  Would be great to comment on this.  Alternatively, ma
Done


http://gerrit.cloudera.org:8080/#/c/20166/29/src/kudu/util/maintenance_manager-test.cc@1046
PS29, Line 1046:
               :
> Is there a way to check for the number of operations still running or opera
50ms is enough, which is also accepted, I think..



--
To view, visit http://gerrit.cloudera.org:8080/20166
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idc2fd3a850cf99d54ef2980211b712468440ed80
Gerrit-Change-Number: 20166
Gerrit-PatchSet: 31
Gerrit-Owner: Song Jiacheng <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Song Jiacheng <[email protected]>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wang Xixu <[email protected]>
Gerrit-Comment-Date: Fri, 20 Oct 2023 06:28:59 +0000
Gerrit-HasComments: Yes

Reply via email to