[jira] [Comment Edited] (IGNITE-18475) Huge performance drop with enabled sync write per log entry for RAFT logs

Ivan Bessonov (Jira) Fri, 10 Mar 2023 05:22:13 -0800


    [ 
https://issues.apache.org/jira/browse/IGNITE-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698938#comment-17698938
 ]


Ivan Bessonov edited comment on IGNITE-18475 at 3/10/23 1:21 PM:
-----------------------------------------------------------------

First of all, what are the implications of completely disabling fsync for the 
log.
 # If a minority of nodes have been restarted with the loss of log suffix, 
everything works fine. Nodes are treated according to their real state, log is 
replicated once again.
Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartSingleNode{}}}.
 # If a majority of nodes have been restarted, but only the minority has a loss 
of log suffix, everything works fine.
Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartTwoNodes{}}}.
This means that, in any situation, if only a minority of nodes lost the log 
suffix, raft group remains healthy and consistent.
 # If a majority of nodes have been restarted, with the majority experiencing 
the loss of log suffix, things become unstable:
 ## If leader has not been restarted, it may replicate the log suffix to the 
followers that experienced data loss. If this happened, data will be consistent.
 ## If leader has been restarted, the re-election will occur. Now it all 
depends on its results.
 ### Node with newest data is elected as a leader - everything's fine, data 
will be consistent after replication.
 ### Node with data loss is elected as a leader. Two things may happen:
 #### {-}If only a single RAFT log entry has been lost{-}, according to a new 
leader, the group will move into broken state. For example:

{code:java}
// Before start:
Node 0 (online)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]
  2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1]
Node 1 (offline)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]
Node 2 (offline)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]

// After start:
Node 0 (online)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]
  2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1]
Node 1 (online)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]
  2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ..., 
data=1]
Node 2 (online)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]
  2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ..., 
data=1]{code}
Log for node 0 is silently "corrupted", data is inconsistent, configuration is 
inconsistent. {*}This is, most likely, a bug in JRaft{*}.
Following message can be seen in such test, instead of error, for node 0:
{code:java}
WARNING: Received entries of which the lastLog=2 is not greater than 
appliedIndex=2, return immediately with nothing changed. {code}
 # 
 ## 
 ### 
 #### 2. {-}If multiple log entries have been lost{-}, according to a new 
leader, aforementioned bug is not happening. New majority, that consists of old 
nodes, will continue working, while old minority with "newer" data will fail to 
replicate new updates. To my knowledge, no attempts of snapshot installation 
would take place.
Some data is permanently lost, if not recovered manually. Some group nodes 
required manual cleanup. Otherwise, data is consistent.
EDIT: _real conditions are not known. Same behavior can be reproduced in both 
cases._

4. Full cluster restart, where majority of nodes lose log suffix, seems to be 
equivalent to 3.2.2

 

Jira can't handle code blocks inside of lists, sorry for messed formatting


was (Author: ibessonov):
First of all, what are the implications of completely disabling fsync for the 
log.
 # If a minority of nodes have been restarted with the loss of log suffix, 
everything works fine. Nodes are treated according to their real state, log is 
replicated once again.
Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartSingleNode{}}}.
 # If a majority of nodes have been restarted, but only the minority has a loss 
of log suffix, everything works fine.
Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartTwoNodes{}}}.
This means that, in any situation, if only a minority of nodes lost the log 
suffix, raft group remains healthy and consistent.
 # If a majority of nodes have been restarted, with the majority experiencing 
the loss of log suffix, things become unstable:
 ## If leader has not been restarted, it may replicate the log suffix to the 
followers that experienced data loss. If this happened, data will be consistent.
 ## If leader has been restarted, the re-election will occur. Now it all 
depends on its results.
 ### Node with newest data is elected as a leader - everything's fine, data 
will be consistent after replication.
 ### Node with data loss is elected as a leader. Two things may happen:
 #### If only a single RAFT log entry has been lost, according to a new leader, 
the group will move into broken state. For example:

{code:java}
// Before start:
Node 0 (online)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]
  2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1]
Node 1 (offline)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]
Node 2 (offline)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]

// After start:
Node 0 (online)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]
  2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1]
Node 1 (online)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]
  2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ..., 
data=1]
Node 2 (online)
  1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., 
data=0]
  2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ..., 
data=1]{code}
Log for node 0 is silently "corrupted", data is inconsistent, configuration is 
inconsistent. {*}This is, most likely, a bug in JRaft{*}.
Following message can be seen in such test, instead of error, for node 0:
{code:java}
WARNING: Received entries of which the lastLog=2 is not greater than 
appliedIndex=2, return immediately with nothing changed. {code}
 # 
 ## 
 ### 
 #### 2. If multiple log entries have been lost, according to a new leader, 
aforementioned bug is not happening. New majority, that consists of old nodes, 
will continue working, while old minority with "newer" data will fail to 
replicate new updates. To my knowledge, no attempts of snapshot installation 
would take place.
Some data is permanently lost, if not recovered manually. Some group nodes 
required manual cleanup. Otherwise, data is consistent.

4. Full cluster restart, where majority of nodes lose log suffix, seems to be 
equivalent to 3.2.2

 

Jira can't handle code blocks inside of lists, sorry for messed formatting

> Huge performance drop with enabled sync write per log entry for RAFT logs
> -------------------------------------------------------------------------
>
>                 Key: IGNITE-18475
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18475
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Kirill Gusakov
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> During the YCSB benchmark runs for ignite-3 beta1 we found out, that we have 
> significant issues with performance for select/insert queries.
> One of the root cause of these issues: write every log entry to rocksdb with 
> enabled sync option (which leads to frequent fsync calls).
> These issues can be reproduced by localised jmh benchmarks 
> [SelectBenchmark|https://github.com/gridgain/apache-ignite-3/blob/4b9de922caa4aef97a5e8e159d5db76a3fc7a3ad/modules/runner/src/test/java/org/apache/ignite/internal/benchmark/SelectBenchmark.java#L39]
>  and 
> [InsertBenchmark|https://github.com/gridgain/apache-ignite-3/blob/4b9de922caa4aef97a5e8e159d5db76a3fc7a3ad/modules/runner/src/test/java/org/apache/ignite/internal/benchmark/InsertBenchmark.java#L29]
>  with RaftOptions.sync=true/false:
>  * jdbc select queries: 115ms vs 4ms
>  * jdbc insert queries: 70ms vs 2.5ms
> (These results received on MacBook Pro (16-inch, 2019) and it looks like 
> macOS has slow fsync command in general, but runs on Ubuntu shows the huge 
> different also (~26 times for insert test). So, your environment can show 
> another, but still huge difference.)
> Why select queries suffers from syncs even more, than inserts, described in 
> https://issues.apache.org/jira/browse/IGNITE-18474.
> Possible solutions for the issue:
>  * Doesn't sync every raft record in rocksdb by default, but it can break the 
> raft invariants
>  * Investigate the inner parts of RocksDB (according syscall tracing, not 
> every write with sync produce fsync syscall), maybe another strategies wll be 
> suitable for our cases
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (IGNITE-18475) Huge performance drop with enabled sync write per log entry for RAFT logs

Reply via email to