[ 
https://issues.apache.org/jira/browse/KUDU-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961550#comment-16961550
 ] 

Bankim Bhavsar commented on KUDU-2963:
--------------------------------------

{{AsyncCreateReplica}} is the class that sends CreateTablet request and retries 
in case of failure.

Constructor of {{AsyncCreateReplica}} sets a default deadline of 30 secs which 
is less than the default deadline of 1hr set by the base class constructor 
{{RetryingTSRpcTask}}.
https://github.com/apache/kudu/blob/master/src/kudu/master/catalog_manager.cc#L3390

{{FLAGS_tablet_creation_timeout_ms}} is used in 2 places:
 - while setting deadline for the {{AsyncCreateReplica}}
 - issuing DeleteTablet RPCs in case of failure to CreateTablet after the 
timeout of FLAGS_tablet_creation_timeout_ms

The deadline is honoured and no additional RPC requests are scheduled on 
crossing the deadline.
https://github.com/apache/kudu/blob/master/src/kudu/master/catalog_manager.cc#L3276
https://github.com/apache/kudu/blob/master/src/kudu/master/catalog_manager.cc#L3287

I was able to verify this by tweaking 
{{CreateTableITest::TestCreateWhenMajorityOfReplicasFailCreation}}.
Verified that CreateTablet RCPs time out and no additional ones are issued to 
the tablet servers.

Next I can work on a unit test that proves that CreateTable RPCs are not 
retried indefinitely.

----

On other note, I noticed a bug in the 
{{CreateTableITest::TestCreateWhenMajorityOfReplicasFailCreation}} test that 
would have been the cause of flakiness fixed by this 
[commit|https://github.com/apache/kudu/commit/d119d529beb691a84134d02e33ecdce6102a7a35]

https://github.com/apache/kudu/blob/master/src/kudu/integration-tests/create-table-itest.cc#L133-L134

> Catalog manager never gives up on CreateTablet RPCs
> ---------------------------------------------------
>
>                 Key: KUDU-2963
>                 URL: https://issues.apache.org/jira/browse/KUDU-2963
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.11.0
>            Reporter: Adar Dembo
>            Assignee: Bankim Bhavsar
>            Priority: Major
>              Labels: newbie
>
> This is a problem when there aren't enough live tservers upon which to place 
> a tablet's replicas, or when a chosen tserver doesn't create the replica 
> quickly enough. If the catalog manager decides to replace the tablet, the 
> replaced tablet's CreateTablet RPCs continue to retry ad infinitum. If the 
> previously dead tservers then come back to life, they must needlessly process 
> the CreateTablet RPCs.
> The tablets are eventually deleted, either through explicit DeleteTablet RPCs 
> (triggered by the catalog manager replacement process), or by heartbeating, 
> but it's an unnecessary drain on cluster resources.
> We should probably abort CreateTablet RPCs for tablets that have been removed 
> from their table.
> CreateTableITest_TestCreateWhenMajorityOfReplicasFailCreation demonstrates 
> this acutely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to