[
https://issues.apache.org/jira/browse/RATIS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971084#comment-16971084
]
Clay B. edited comment on RATIS-695 at 11/10/19 9:53 AM:
---------------------------------------------------------
I found a neat library [Failsafe|http://jodah.net/failsafe] which offers a
cleanish API for doing retries (and circuitbreakers and timeouts). I hacked
lots of I/O points but didn't cover the Ratis log service nor snapshots. (I
would need tests for these in the Vagrant process.)
However, with the Vagrant based arithmetic tests and the md5sum done after on
the log directories I see all three servers writing the same data out even at a
50.1% {{EIO}} injection rate. Flapping of the leader becomes really, really bad
at 60% (because the I/O retries are taking so long we can't converge before the
default Ratis Raft timeout) – though I saw all three logs were consistent too
at that level after a minute or two on a modern MacBook.
Example output after a few runs with a 50.1% failure rate using the attached
patch:
{code:java}
Total files written: 100
Each files size: 1048576
Total data written: 104857600 bytes
Total time taken: 3396 millis
Verification of all Ratis file server logs have the same checksum across all
storage directories:
300 07bf94afba7ac52bdcd1e8c954a4edd1
300 0ed744997cf39dbac7aa4422239bcd40
300 3a249568b521845516d0aaab82761880
300 97b8b419fcd672ac1fda9b98c3d35a36
300 f874626c6b816533e619e84ed021d264
=== Command terminated normally (Sun Nov 10 09:41:47 2019) ===
{code}
I didn't see any unhandled {{IOExceptions}} after these runs either (211
exceptions causing 211 retries):
{code:java}
vagrant@ratis-hdd-slowdown:~$ egrep --no-group-separator -B1
'FileSystemException|java.io.IOException' server_n0.log | sed 's/2019-[^ ]*
09:[^ ]*//' | sort | uniq -c|sort -n
1 WARN SimpleStateMachineStorage:58 - Retrying operation:
1 java.nio.file.FileSystemException:
/home/vagrant/test_data/data0_slowed/64656d6f-5261-6674-4772-6f7570313233/sm:
Input/output error
4 WARN RaftStorageDirectory:53 - Retrying operation:
4 java.nio.file.FileSystemException:
/home/vagrant/test_data/data0_slowed/64656d6f-5261-6674-4772-6f7570313233/current:
Input/output error
56 WARN SegmentedRaftLogOutputStream:48 - Retrying operation:
150 WARN AtomicFileOutputStream:56 - Retrying:
206 java.io.IOException: Input/output error{code}
was (Author: clayb):
I found a neat library [Failsafe|http://jodah.net/failsafe] which offers a
cleanish API for doing retries (and circuitbreakers and timeouts). I hacked
lots of I/O points but didn't cover the Ratis statemachine nor snapshots. (I
would need tests for these in the Vagrant process.)
However, with the Vagrant based arithmetic tests and the md5sum done after on
the log directories I see all three servers writing the same data out even at a
50.1% {{EIO}} injection rate. Flapping of the leader becomes really, really bad
at 60% (because the I/O retries are taking so long we can't converge before the
default Ratis Raft timeout) – though I saw all three logs were consistent too
at that level after a minute or two on a modern MacBook.
Example output after a few runs with a 50.1% failure rate using the attached
patch:
{code:java}
Total files written: 100
Each files size: 1048576
Total data written: 104857600 bytes
Total time taken: 3396 millis
Verification of all Ratis file server logs have the same checksum across all
storage directories:
300 07bf94afba7ac52bdcd1e8c954a4edd1
300 0ed744997cf39dbac7aa4422239bcd40
300 3a249568b521845516d0aaab82761880
300 97b8b419fcd672ac1fda9b98c3d35a36
300 f874626c6b816533e619e84ed021d264
=== Command terminated normally (Sun Nov 10 09:41:47 2019) ===
{code}
I didn't see any unhandled {{IOExceptions}} after these runs either (211
exceptions causing 211 retries):
{code:java}
vagrant@ratis-hdd-slowdown:~$ egrep --no-group-separator -B1
'FileSystemException|java.io.IOException' server_n0.log | sed 's/2019-[^ ]*
09:[^ ]*//' | sort | uniq -c|sort -n
1 WARN SimpleStateMachineStorage:58 - Retrying operation:
1 java.nio.file.FileSystemException:
/home/vagrant/test_data/data0_slowed/64656d6f-5261-6674-4772-6f7570313233/sm:
Input/output error
4 WARN RaftStorageDirectory:53 - Retrying operation:
4 java.nio.file.FileSystemException:
/home/vagrant/test_data/data0_slowed/64656d6f-5261-6674-4772-6f7570313233/current:
Input/output error
56 WARN SegmentedRaftLogOutputStream:48 - Retrying operation:
150 WARN AtomicFileOutputStream:56 - Retrying:
206 java.io.IOException: Input/output error{code}
> Improve running in the face of flakey disks
> -------------------------------------------
>
> Key: RATIS-695
> URL: https://issues.apache.org/jira/browse/RATIS-695
> Project: Ratis
> Issue Type: Improvement
> Components: server
> Reporter: Clay B.
> Priority: Minor
> Labels: namazu
> Attachments: 0001-RATIS-695.-SimpleStateMachineStorage.java.patch
>
>
> In testing with
> [Namazu|https://github.com/apache/incubator-ratis/blob/35838f032a4096d78843130fa1435bcddf5ce961/dev-support/vagrant/README.md#ratis-hdd-slowdown-vm]
> disk paths which fail in the face of {{IOException}}s are found. This
> umbrella-JIRA is to track the code paths found that need hardening. These
> code paths seem to be fatal to the Ratis server performing actions but does
> not cause the server to abort out.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)