[
https://issues.apache.org/jira/browse/KUDU-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964616#comment-16964616
]
YangSong commented on KUDU-2975:
--------------------------------
{quote}So if the tablet is reincarnated on the same tserver in the same term,
the tombstone will tell us who we voted for and we can't change our vote. That
said, I also know that LEADER replicas persist a NOOP op to the WAL before
replicating any additional ops, but I don't remember why we do that. Is it OK
if that NOOP were to disappear between the two replicas' life times?
{quote}
This is my understanding, maybe not exactly. NOOP op has two functions:
1. Tell followers about the progress of WAL in leader. Let the follower whose
WAL is behind catch up with the leader as soon as possible.
2. Let the new term has a committed index. We are required by Raft to reject
config change operations until we have committed at least one operation in our
current term as leader.
Even if there is no write op on the leader tablet, the NOOP can be replicated
during the next heartbeat. So the follower has a chance to catch up
immediately, and change config op can be allowed, instead of waiting for the
next write operation.
Is it OK if that NOOP were to disappear between the two replicas' life times? I
think it is OK. The same question can be asked, Is it safe for a election to
occur during tablet recover? Based on the current recovery strategy, leader
replica checked follower failed, reports consensus meta to master, master send
a change config to leader replica(add a new NO_VOTER replica), leader replica
start a change config op.
First assume that the failed follower replica never gets up during the recover.
Before the new config is committed, the leader step down, there are two
scenarios: if the downed leader persisted the new config, change config op
continues after downed leader is restarted, if not, change config op will be
lost, and restart by master later. Once the new configuration is committed, the
log index of the failed follower will fall behind, it cannot be leader any
more. And the new committed config will be reported to master, master will do
the rest work. Even the election occurs later, the new leader will guaranteed
to complete the recover normally.
Then assume that the failed follower replica may rejoin at any point in time.
If it rejoin immedidately, it may start election. If the leader is alive,
another follower will deny it's vote("withhold_votes_until_" it guarantees). If
the leader step down, the rejoined follower may become leader, but that doesn't
matter. Similarly if the new configuration is persisted on other replica, the
rejoined follower cannot be leader any more.
Throughout the recovery process, the failed follower is deleted by master, set
it tombtoned, and before delete it, a new replica must be recovered(is a VOTER
in consensus config). Even the two replicas may have same term, the logindex
must be different. So I don't think it's a problem.
> Spread WAL across multiple data directories
> -------------------------------------------
>
> Key: KUDU-2975
> URL: https://issues.apache.org/jira/browse/KUDU-2975
> Project: Kudu
> Issue Type: New Feature
> Components: fs, tablet, tserver
> Reporter: LiFu He
> Priority: Major
> Attachments: network.png, tserver-WARNING.png, util.png
>
>
> Recently, we deployed a new kudu cluster and every node has 12 SSD. Then, we
> created a big table and loaded data to it through flink. We noticed that the
> util of one SSD which is used to store WAL is 100% but others are free. So,
> we suggest to spread WAL across multiple data directories.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)