[jira] [Commented] (KUDU-2975) Spread WAL across multiple data directories

YangSong (Jira) Thu, 31 Oct 2019 22:30:27 -0700


    [ 
https://issues.apache.org/jira/browse/KUDU-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964616#comment-16964616
 ]


YangSong commented on KUDU-2975:
--------------------------------

{quote}So if the tablet is reincarnated on the same tserver in the same term, 
the tombstone will tell us who we voted for and we can't change our vote. That 
said, I also know that LEADER replicas persist a NOOP op to the WAL before 
replicating any additional ops, but I don't remember why we do that. Is it OK 
if that NOOP were to disappear between the two replicas' life times?
{quote}
 

This is my understanding, maybe not exactly. NOOP op has two functions:

1. Tell followers about the progress of WAL in leader.  Let the follower whose 
WAL is behind catch up with the leader as soon as possible. 

2. Let the new term has a committed index. We are required by Raft to reject 
config change operations until we have committed at least one operation in our 
current term as leader.

Even if there is no write op on the leader tablet, the NOOP can be replicated 
during the next heartbeat. So the follower has a chance to catch up 
immediately,  and change config op can be allowed, instead of waiting for the 
next write operation.

Is it OK if that NOOP were to disappear between the two replicas' life times? I 
think it is OK. The same question can be asked, Is it safe for a election to 
occur during tablet recover? Based on the current recovery strategy, leader 
replica checked follower failed, reports consensus meta to master, master send 
a change config to leader replica(add a new NO_VOTER replica), leader replica 
start a change config op.

First assume that the failed follower replica never gets up during the recover. 
Before the new config is committed, the leader step down, there are two 
scenarios: if the downed leader persisted the new config, change config op 
continues after downed leader is restarted, if not, change config op will be 
lost, and restart by master later. Once the new configuration is committed, the 
log index of the failed follower will fall behind, it cannot be leader any 
more. And the new committed config will be reported to master, master will do 
the rest work. Even the election occurs later, the new leader will guaranteed 
to complete the recover normally.

Then assume that the failed follower replica may rejoin at any point in time. 
If it rejoin immedidately, it may start election. If the leader is alive, 
another follower will deny it's vote("withhold_votes_until_" it guarantees). If 
the leader step down, the rejoined follower may become leader, but that doesn't 
matter. Similarly if the new configuration is persisted on other replica, the 
rejoined follower cannot be leader any more. 

Throughout the recovery process, the failed follower is deleted by master, set 
it tombtoned, and before delete it, a new replica must be recovered(is a VOTER 
in consensus config). Even the two replicas may have same term, the logindex 
must be different. So I don't think it's a problem.

> Spread WAL across multiple data directories
> -------------------------------------------
>
>                 Key: KUDU-2975
>                 URL: https://issues.apache.org/jira/browse/KUDU-2975
>             Project: Kudu
>          Issue Type: New Feature
>          Components: fs, tablet, tserver
>            Reporter: LiFu He
>            Priority: Major
>         Attachments: network.png, tserver-WARNING.png, util.png
>
>
> Recently, we deployed a new kudu cluster and every node has 12 SSD. Then, we 
> created a big table and loaded data to it through flink.  We noticed that the 
> util of one SSD which is used to store WAL is 100% but others are free. So, 
> we suggest to spread WAL across multiple data directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KUDU-2975) Spread WAL across multiple data directories

Reply via email to