[
https://issues.apache.org/jira/browse/KUDU-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016507#comment-17016507
]
ASF subversion and git services commented on KUDU-3011:
-------------------------------------------------------
Commit 54db215511e84785a8649ba1e52911f8adfb11e4 in kudu's branch
refs/heads/master from Andrew Wong
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=54db215 ]
KUDU-3011 p5: transfer leadership when quiescing
This amends the behavior of quiescing such that when a tablet server is
quiescing, it will transfer leadership to a caught-up follower as soon
as it can.
While in this state, unlike while in a graceful stepdown period, the
tablet can still be written to, as to not obstruct on-going workloads.
Tests are added to exercise:
- The basic behavior: even without injecting any errors that might cause
elections, a quiescing leader will relinquish leadership.
- The behavior when there are followers being caught up. In such cases,
the leader won't immediately relinquish leadership -- instead, it will
wait for the followers to catch up before stepping down.
- The behavior when being written to. The fact that a leader is
quiescing shouldn't affect its ability to be written to.
- The behavior of the PeerMessageQueue when responding to various peer
responses.
I also removed some election-causing injection in a couple existing
tests that was previously required to transfer leadership while
quiescing.
Note: right now, if all tablet servers are quiescing while there is a
write workload on-going, a large number of StartElection requests will
be sent from the leaders to the followers. A follow-up patch will
address this.
Change-Id: Idbf0716f5c9455f83ff5f6f601b0f5042f77d078
Reviewed-on: http://gerrit.cloudera.org:8080/15012
Reviewed-by: Adar Dembo <[email protected]>
Reviewed-by: Alexey Serbin <[email protected]>
Tested-by: Andrew Wong <[email protected]>
> Support for smooth maintenance window
> -------------------------------------
>
> Key: KUDU-3011
> URL: https://issues.apache.org/jira/browse/KUDU-3011
> Project: Kudu
> Issue Type: New Feature
> Reporter: LiFu He
> Assignee: Andrew Wong
> Priority: Major
>
> A scan corresponding to a tablet failure causes the entire SQL to fail on the
> common query engines, such as Impala. Though we have the fault-tolerant
> feature by "SetFaultTolerant()", Impala doesn't use it right now since that
> will make lower throughput. Thus, lots of SQL that are running will fail when
> we shutdown/reboot/upgrade the tserver. That can be scary.
> Maybe we can do some improvement in this area, for example, the tablets are
> not allowed to be scanned after the tserver is in maintenance mode
> (KUDU-2069). And for the LEADER_ONLY mode scanning, the leader role needs to
> be shifted from the maintenance tserver. Then we can shutdown the tserver
> smoothly after all the existing SQL are completed.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)