[
https://issues.apache.org/jira/browse/IGNITE-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698938#comment-17698938
]
Ivan Bessonov edited comment on IGNITE-18475 at 3/10/23 1:21 PM:
-----------------------------------------------------------------
First of all, what are the implications of completely disabling fsync for the
log.
# If a minority of nodes have been restarted with the loss of log suffix,
everything works fine. Nodes are treated according to their real state, log is
replicated once again.
Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartSingleNode{}}}.
# If a majority of nodes have been restarted, but only the minority has a loss
of log suffix, everything works fine.
Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartTwoNodes{}}}.
This means that, in any situation, if only a minority of nodes lost the log
suffix, raft group remains healthy and consistent.
# If a majority of nodes have been restarted, with the majority experiencing
the loss of log suffix, things become unstable:
## If leader has not been restarted, it may replicate the log suffix to the
followers that experienced data loss. If this happened, data will be consistent.
## If leader has been restarted, the re-election will occur. Now it all
depends on its results.
### Node with newest data is elected as a leader - everything's fine, data
will be consistent after replication.
### Node with data loss is elected as a leader. Two things may happen:
#### {-}If only a single RAFT log entry has been lost{-}, according to a new
leader, the group will move into broken state. For example:
{code:java}
// Before start:
Node 0 (online)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1]
Node 1 (offline)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
Node 2 (offline)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
// After start:
Node 0 (online)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1]
Node 1 (online)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ...,
data=1]
Node 2 (online)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ...,
data=1]{code}
Log for node 0 is silently "corrupted", data is inconsistent, configuration is
inconsistent. {*}This is, most likely, a bug in JRaft{*}.
Following message can be seen in such test, instead of error, for node 0:
{code:java}
WARNING: Received entries of which the lastLog=2 is not greater than
appliedIndex=2, return immediately with nothing changed. {code}
#
##
###
#### 2. {-}If multiple log entries have been lost{-}, according to a new
leader, aforementioned bug is not happening. New majority, that consists of old
nodes, will continue working, while old minority with "newer" data will fail to
replicate new updates. To my knowledge, no attempts of snapshot installation
would take place.
Some data is permanently lost, if not recovered manually. Some group nodes
required manual cleanup. Otherwise, data is consistent.
EDIT: _real conditions are not known. Same behavior can be reproduced in both
cases._
4. Full cluster restart, where majority of nodes lose log suffix, seems to be
equivalent to 3.2.2
Jira can't handle code blocks inside of lists, sorry for messed formatting
was (Author: ibessonov):
First of all, what are the implications of completely disabling fsync for the
log.
# If a minority of nodes have been restarted with the loss of log suffix,
everything works fine. Nodes are treated according to their real state, log is
replicated once again.
Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartSingleNode{}}}.
# If a majority of nodes have been restarted, but only the minority has a loss
of log suffix, everything works fine.
Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartTwoNodes{}}}.
This means that, in any situation, if only a minority of nodes lost the log
suffix, raft group remains healthy and consistent.
# If a majority of nodes have been restarted, with the majority experiencing
the loss of log suffix, things become unstable:
## If leader has not been restarted, it may replicate the log suffix to the
followers that experienced data loss. If this happened, data will be consistent.
## If leader has been restarted, the re-election will occur. Now it all
depends on its results.
### Node with newest data is elected as a leader - everything's fine, data
will be consistent after replication.
### Node with data loss is elected as a leader. Two things may happen:
#### If only a single RAFT log entry has been lost, according to a new leader,
the group will move into broken state. For example:
{code:java}
// Before start:
Node 0 (online)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1]
Node 1 (offline)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
Node 2 (offline)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
// After start:
Node 0 (online)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1]
Node 1 (online)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ...,
data=1]
Node 2 (online)
1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ...,
data=0]
2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ...,
data=1]{code}
Log for node 0 is silently "corrupted", data is inconsistent, configuration is
inconsistent. {*}This is, most likely, a bug in JRaft{*}.
Following message can be seen in such test, instead of error, for node 0:
{code:java}
WARNING: Received entries of which the lastLog=2 is not greater than
appliedIndex=2, return immediately with nothing changed. {code}
#
##
###
#### 2. If multiple log entries have been lost, according to a new leader,
aforementioned bug is not happening. New majority, that consists of old nodes,
will continue working, while old minority with "newer" data will fail to
replicate new updates. To my knowledge, no attempts of snapshot installation
would take place.
Some data is permanently lost, if not recovered manually. Some group nodes
required manual cleanup. Otherwise, data is consistent.
4. Full cluster restart, where majority of nodes lose log suffix, seems to be
equivalent to 3.2.2
Jira can't handle code blocks inside of lists, sorry for messed formatting
> Huge performance drop with enabled sync write per log entry for RAFT logs
> -------------------------------------------------------------------------
>
> Key: IGNITE-18475
> URL: https://issues.apache.org/jira/browse/IGNITE-18475
> Project: Ignite
> Issue Type: Task
> Reporter: Kirill Gusakov
> Assignee: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
> Time Spent: 10m
> Remaining Estimate: 0h
>
> During the YCSB benchmark runs for ignite-3 beta1 we found out, that we have
> significant issues with performance for select/insert queries.
> One of the root cause of these issues: write every log entry to rocksdb with
> enabled sync option (which leads to frequent fsync calls).
> These issues can be reproduced by localised jmh benchmarks
> [SelectBenchmark|https://github.com/gridgain/apache-ignite-3/blob/4b9de922caa4aef97a5e8e159d5db76a3fc7a3ad/modules/runner/src/test/java/org/apache/ignite/internal/benchmark/SelectBenchmark.java#L39]
> and
> [InsertBenchmark|https://github.com/gridgain/apache-ignite-3/blob/4b9de922caa4aef97a5e8e159d5db76a3fc7a3ad/modules/runner/src/test/java/org/apache/ignite/internal/benchmark/InsertBenchmark.java#L29]
> with RaftOptions.sync=true/false:
> * jdbc select queries: 115ms vs 4ms
> * jdbc insert queries: 70ms vs 2.5ms
> (These results received on MacBook Pro (16-inch, 2019) and it looks like
> macOS has slow fsync command in general, but runs on Ubuntu shows the huge
> different also (~26 times for insert test). So, your environment can show
> another, but still huge difference.)
> Why select queries suffers from syncs even more, than inserts, described in
> https://issues.apache.org/jira/browse/IGNITE-18474.
> Possible solutions for the issue:
> * Doesn't sync every raft record in rocksdb by default, but it can break the
> raft invariants
> * Investigate the inner parts of RocksDB (according syscall tracing, not
> every write with sync produce fsync syscall), maybe another strategies wll be
> suitable for our cases
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)