[
https://issues.apache.org/jira/browse/KUDU-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961550#comment-16961550
]
Bankim Bhavsar commented on KUDU-2963:
--------------------------------------
{{AsyncCreateReplica}} is the class that sends CreateTablet request and retries
in case of failure.
Constructor of {{AsyncCreateReplica}} sets a default deadline of 30 secs which
is less than the default deadline of 1hr set by the base class constructor
{{RetryingTSRpcTask}}.
https://github.com/apache/kudu/blob/master/src/kudu/master/catalog_manager.cc#L3390
{{FLAGS_tablet_creation_timeout_ms}} is used in 2 places:
- while setting deadline for the {{AsyncCreateReplica}}
- issuing DeleteTablet RPCs in case of failure to CreateTablet after the
timeout of FLAGS_tablet_creation_timeout_ms
The deadline is honoured and no additional RPC requests are scheduled on
crossing the deadline.
https://github.com/apache/kudu/blob/master/src/kudu/master/catalog_manager.cc#L3276
https://github.com/apache/kudu/blob/master/src/kudu/master/catalog_manager.cc#L3287
I was able to verify this by tweaking
{{CreateTableITest::TestCreateWhenMajorityOfReplicasFailCreation}}.
Verified that CreateTablet RCPs time out and no additional ones are issued to
the tablet servers.
Next I can work on a unit test that proves that CreateTable RPCs are not
retried indefinitely.
----
On other note, I noticed a bug in the
{{CreateTableITest::TestCreateWhenMajorityOfReplicasFailCreation}} test that
would have been the cause of flakiness fixed by this
[commit|https://github.com/apache/kudu/commit/d119d529beb691a84134d02e33ecdce6102a7a35]
https://github.com/apache/kudu/blob/master/src/kudu/integration-tests/create-table-itest.cc#L133-L134
> Catalog manager never gives up on CreateTablet RPCs
> ---------------------------------------------------
>
> Key: KUDU-2963
> URL: https://issues.apache.org/jira/browse/KUDU-2963
> Project: Kudu
> Issue Type: Improvement
> Components: master
> Affects Versions: 1.11.0
> Reporter: Adar Dembo
> Assignee: Bankim Bhavsar
> Priority: Major
> Labels: newbie
>
> This is a problem when there aren't enough live tservers upon which to place
> a tablet's replicas, or when a chosen tserver doesn't create the replica
> quickly enough. If the catalog manager decides to replace the tablet, the
> replaced tablet's CreateTablet RPCs continue to retry ad infinitum. If the
> previously dead tservers then come back to life, they must needlessly process
> the CreateTablet RPCs.
> The tablets are eventually deleted, either through explicit DeleteTablet RPCs
> (triggered by the catalog manager replacement process), or by heartbeating,
> but it's an unnecessary drain on cluster resources.
> We should probably abort CreateTablet RPCs for tablets that have been removed
> from their table.
> CreateTableITest_TestCreateWhenMajorityOfReplicasFailCreation demonstrates
> this acutely.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)