[
https://issues.apache.org/jira/browse/KUDU-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dan Burkert reassigned KUDU-2472:
---------------------------------
Assignee: Dan Burkert
> master-stress-test flaky with failure to create table due to not enough
> tservers
> --------------------------------------------------------------------------------
>
> Key: KUDU-2472
> URL: https://issues.apache.org/jira/browse/KUDU-2472
> Project: Kudu
> Issue Type: Bug
> Reporter: Dan Burkert
> Assignee: Dan Burkert
> Priority: Major
>
> Currently {{master-stress-test}} is 5-7% flaky, failing during a create table
> operation:
> {code:java}
> F0611 20:58:01.335697 23508 master-stress-test.cc:217] Check failed: _s.ok()
> Bad status: Invalid argument: Error creating table
> default.table_6473953088b54f90af982172a0471cf6 on the master: not enough live
> tablet servers to create a table with the requested replication factor 3; 2
> tablet servers are alive{code}
> Due to the frequent master failovers introduced by the test, CREATE TABLE
> operations are failing because not enough tablet servers are known to be
> alive by the current leader master, who likely was just started and quickly
> elected.
> In this case the master returns an InvalidArgument status to the client,
> which is not retried. This indicates a real issue that could occur in a
> production cluster, if the leader master were restarted and quickly regained
> leadership. I'm not sure yet what the right fix is, I can think of at least
> a few:
> * Change the return status to be ServiceUnavailable. The client will retry
> up to the timeout. The downside is that in legitimate scenarios where there
> aren't enough tablet servers the operation will take the full timeout to
> fail, and probably have a less useful error status type. Perhaps we could
> have a heuristic which says that if the leader hasn't been active for at
> least {{n * heartbeat_interval}} (where n is a small integer), then
> ServiceUnavailable is used.
> * Change master-stress-test to use replication 1 tables. This makes it much
> less likely for the race to occur, although it's still possible. This also
> doesn't fix the underlying issue.
> * Introduce a special case in the table creating thread of
> master-stress-test to retry the specific {{InvalidArgument}} status. Also
> doesn't fix the underlying issue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)