[
https://issues.apache.org/jira/browse/RATIS-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764826#comment-17764826
]
Wei-Chiu Chuang commented on RATIS-1886:
----------------------------------------
I'm seeing the same with Ozone hflush API latency.
Using the Ozone Freon tool to hflush 32 files at the same time, the median
latency improves by ~3x after reducing raft.server.log.appender.wait-time.min
to 1ms.
{noformat}
ozone freon dfsg --buffer=1024 --copy-buffer=1024 -n 1000 --path
ofs://ozone1/vol1/bucket1/dfsg -s 10240 -t 32 -sync hflush {noformat}
Before:
{noformat}
file-create
count = 1000
mean rate = 7.78 calls/second
1-minute rate = 7.11 calls/second
5-minute rate = 2.68 calls/second
15-minute rate = 1.01 calls/second
min = 325.44 milliseconds
max = 7368.44 milliseconds
mean = 3998.90 milliseconds
stddev = 2182.79 milliseconds
median = 4303.67 milliseconds
75% <= 6207.01 milliseconds
95% <= 6848.42 milliseconds
98% <= 7191.24 milliseconds
99% <= 7288.67 milliseconds
99.9% <= 7356.27 milliseconds {noformat}
After:
{noformat}
file-create
count = 1000
mean rate = 24.42 calls/second
1-minute rate = 22.75 calls/second
5-minute rate = 21.44 calls/second
15-minute rate = 21.15 calls/second
min = 157.16 milliseconds
max = 2459.12 milliseconds
mean = 1282.02 milliseconds
stddev = 850.42 milliseconds
median = 1245.10 milliseconds
75% <= 2169.65 milliseconds
95% <= 2296.89 milliseconds
98% <= 2361.82 milliseconds
99% <= 2404.80 milliseconds
99.9% <= 2459.12 milliseconds {noformat}
> AppendLog sleep fixed time cause significant drop in write throughput
> ---------------------------------------------------------------------
>
> Key: RATIS-1886
> URL: https://issues.apache.org/jira/browse/RATIS-1886
> Project: Ratis
> Issue Type: Improvement
> Components: server
> Affects Versions: 2.5.1
> Reporter: Yaolong Liu
> Priority: Major
> Attachments: image-2023-09-13-15-44-00-933.png
>
>
> In https://issues.apache.org/jira/browse/RATIS-1793 , we enforce
> raft.server.log.appender.wait-time.min, which make GrpcLogAppender sleep
> fixed time during appendLog. This make alluxio master write throughput drop
> 50% and unacceptable. The ops of alluxio master could see below
> !image-2023-09-13-15-44-00-933.png!
> I noticed that this patch was introduced to avoid leader being too busy in
> some error conditions. Could we introduce sleep waiting when an error is
> discovered (maybe not easy) or find a way to locate the error condition and
> repair it completely? The performance degradation caused by sleeping for each
> appendLog request may be underestimated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)