[jira] [Comment Edited] (RATIS-695) Improve running in the face of flakey disks

Clay B. (Jira) Sun, 10 Nov 2019 01:54:24 -0800


    [ 
https://issues.apache.org/jira/browse/RATIS-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971084#comment-16971084
 ]


Clay B. edited comment on RATIS-695 at 11/10/19 9:53 AM:
---------------------------------------------------------

I found a neat library [Failsafe|http://jodah.net/failsafe] which offers a 
cleanish API for doing retries (and circuitbreakers and timeouts). I hacked 
lots of I/O points but didn't cover the Ratis log service nor snapshots. (I 
would need tests for these in the Vagrant process.)

However, with the Vagrant based arithmetic tests and the md5sum done after on 
the log directories I see all three servers writing the same data out even at a 
50.1% {{EIO}} injection rate. Flapping of the leader becomes really, really bad 
at 60% (because the I/O retries are taking so long we can't converge before the 
default Ratis Raft timeout) – though I saw all three logs were consistent too 
at that level after a minute or two on a modern MacBook.

Example output after a few runs with a 50.1% failure rate using the attached 
patch:
{code:java}
Total files written: 100
Each files size: 1048576
Total data written: 104857600 bytes
Total time taken: 3396 millis
Verification of all Ratis file server logs have the same checksum across all 
storage directories:
    300 07bf94afba7ac52bdcd1e8c954a4edd1
    300 0ed744997cf39dbac7aa4422239bcd40
    300 3a249568b521845516d0aaab82761880
    300 97b8b419fcd672ac1fda9b98c3d35a36
    300 f874626c6b816533e619e84ed021d264
=== Command terminated normally (Sun Nov 10 09:41:47 2019) ===
{code}
I didn't see any unhandled {{IOExceptions}} after these runs either (211 
exceptions causing  211 retries):
{code:java}
vagrant@ratis-hdd-slowdown:~$ egrep --no-group-separator -B1 
'FileSystemException|java.io.IOException' server_n0.log | sed 's/2019-[^ ]* 
09:[^ ]*//' | sort | uniq -c|sort -n
      1  WARN  SimpleStateMachineStorage:58 - Retrying operation:
      1 java.nio.file.FileSystemException: 
/home/vagrant/test_data/data0_slowed/64656d6f-5261-6674-4772-6f7570313233/sm: 
Input/output error
      4  WARN  RaftStorageDirectory:53 - Retrying operation:
      4 java.nio.file.FileSystemException: 
/home/vagrant/test_data/data0_slowed/64656d6f-5261-6674-4772-6f7570313233/current:
 Input/output error
     56  WARN  SegmentedRaftLogOutputStream:48 - Retrying operation:
    150  WARN  AtomicFileOutputStream:56 - Retrying:
    206 java.io.IOException: Input/output error{code}


was (Author: clayb):
I found a neat library [Failsafe|http://jodah.net/failsafe] which offers a 
cleanish API for doing retries (and circuitbreakers and timeouts). I hacked 
lots of I/O points but didn't cover the Ratis statemachine nor snapshots. (I 
would need tests for these in the Vagrant process.)

However, with the Vagrant based arithmetic tests and the md5sum done after on 
the log directories I see all three servers writing the same data out even at a 
50.1% {{EIO}} injection rate. Flapping of the leader becomes really, really bad 
at 60% (because the I/O retries are taking so long we can't converge before the 
default Ratis Raft timeout) – though I saw all three logs were consistent too 
at that level after a minute or two on a modern MacBook.

Example output after a few runs with a 50.1% failure rate using the attached 
patch:
{code:java}
Total files written: 100
Each files size: 1048576
Total data written: 104857600 bytes
Total time taken: 3396 millis
Verification of all Ratis file server logs have the same checksum across all 
storage directories:
    300 07bf94afba7ac52bdcd1e8c954a4edd1
    300 0ed744997cf39dbac7aa4422239bcd40
    300 3a249568b521845516d0aaab82761880
    300 97b8b419fcd672ac1fda9b98c3d35a36
    300 f874626c6b816533e619e84ed021d264
=== Command terminated normally (Sun Nov 10 09:41:47 2019) ===
{code}
I didn't see any unhandled {{IOExceptions}} after these runs either (211 
exceptions causing  211 retries):
{code:java}
vagrant@ratis-hdd-slowdown:~$ egrep --no-group-separator -B1 
'FileSystemException|java.io.IOException' server_n0.log | sed 's/2019-[^ ]* 
09:[^ ]*//' | sort | uniq -c|sort -n
      1  WARN  SimpleStateMachineStorage:58 - Retrying operation:
      1 java.nio.file.FileSystemException: 
/home/vagrant/test_data/data0_slowed/64656d6f-5261-6674-4772-6f7570313233/sm: 
Input/output error
      4  WARN  RaftStorageDirectory:53 - Retrying operation:
      4 java.nio.file.FileSystemException: 
/home/vagrant/test_data/data0_slowed/64656d6f-5261-6674-4772-6f7570313233/current:
 Input/output error
     56  WARN  SegmentedRaftLogOutputStream:48 - Retrying operation:
    150  WARN  AtomicFileOutputStream:56 - Retrying:
    206 java.io.IOException: Input/output error{code}

> Improve running in the face of flakey disks
> -------------------------------------------
>
>                 Key: RATIS-695
>                 URL: https://issues.apache.org/jira/browse/RATIS-695
>             Project: Ratis
>          Issue Type: Improvement
>          Components: server
>            Reporter: Clay B.
>            Priority: Minor
>              Labels: namazu
>         Attachments: 0001-RATIS-695.-SimpleStateMachineStorage.java.patch
>
>
> In testing with 
> [Namazu|https://github.com/apache/incubator-ratis/blob/35838f032a4096d78843130fa1435bcddf5ce961/dev-support/vagrant/README.md#ratis-hdd-slowdown-vm]
>  disk paths which fail in the face of {{IOException}}s are found. This 
> umbrella-JIRA is to track the code paths found that need hardening. These 
> code paths seem to be fatal to the Ratis server performing actions but does 
> not cause the server to abort out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (RATIS-695) Improve running in the face of flakey disks

Reply via email to