[ 
https://issues.apache.org/jira/browse/HDFS-11804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043108#comment-16043108
 ] 

Daryn Sharp commented on HDFS-11804:
------------------------------------

General issues:
* Attempts are not actually attempts, but retries (attempts + 1).  See below, 
only noticed due to test.
* Suggest incrementing {{numFailovers}} in the for loop statement, instead of a 
conditional in the middle of the loop, to be a little more clear about what's 
happening.
* The catch ACE should rethrow instead of "break"-ing since the latter causes 
it to log a misleading line that it tried all the providers when it actually 
didn't.
* If {{shouldRetry}} were to throw an IOE, it will erroneously be re-wrapped in 
another IOE.
* The sleep condition is {{numFailovers >= providers.length}}. I think it makes 
more sense as {{(numFailovers % providers.length) == 0}} to sleep only between 
sweeps of the kms cluster.

Test issues:
* {{testClientRetriesWithAccessControlException}} doesn't appear to test what 
it claims to do.  The 1st provider throws IOE, 2nd throws ACE.  The test 
doesn't verify the ACE stopped the retries.  It tests that the IOE did.
* {{LoadBalancingKMSClientProvider.KMS_FAILOVER_MAX_ATTEMPTS_KEY}} is really 
acting like max retries. {{testClientRetriesSpecifiedNumberOfTimes}} shows that 
10 attempts != 10.

> KMS client needs retry logic
> ----------------------------
>
>                 Key: HDFS-11804
>                 URL: https://issues.apache.org/jira/browse/HDFS-11804
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>         Attachments: HDFS-11804-trunk-1.patch, HDFS-11804-trunk-2.patch, 
> HDFS-11804-trunk-3.patch, HDFS-11804-trunk.patch
>
>
> The kms client appears to have no retry logic – at all.  It's completely 
> decoupled from the ipc retry logic.  This has major impacts if the KMS is 
> unreachable for any reason, including but not limited to network connection 
> issues, timeouts, the +restart during an upgrade+.
> This has some major ramifications:
> # Jobs may fail to submit, although oozie resubmit logic should mask it
> # Non-oozie launchers may experience higher rates if they do not already have 
> retry logic.
> # Tasks reading EZ files will fail, probably be masked by framework reattempts
> # EZ file creation fails after creating a 0-length file – client receives 
> EDEK in the create response, then fails when decrypting the EDEK
> # Bulk hadoop fs copies, and maybe distcp, will prematurely fail



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to