[ 
https://issues.apache.org/jira/browse/HADOOP-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047382#comment-16047382
 ] 

Xiao Chen commented on HADOOP-14521:
------------------------------------

Thanks for sticking with this Rushabh.

bq. From this comment, point 1,
Right... sorry I missed that one.

But what confuses me is that, what if {{maxNumRetries < providers.length}} ? If 
someone configures it to be 2, and we have 3 providers, the call will fail and 
print a message:
{code}
    LOG.warn("Aborting since the Request has failed with all KMS"
              + " providers in the group or the exception is not recoverable.");
{code}
which isn't entirely accurate, because here it did not fail with all providers.

We can either update the message, or update the maxNumRetries to 
providers.length to make sure we try at least once on all providers. Either way 
feels okay to me. Slightly prefer the latter since by assumption all providers 
should work and we don't have retry delay for the whole sweep.

Besides that, I'm +1 on patch 7.

> KMS client needs retry logic
> ----------------------------
>
>                 Key: HADOOP-14521
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14521
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 2.6.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>         Attachments: HDFS-11804-trunk-1.patch, HDFS-11804-trunk-2.patch, 
> HDFS-11804-trunk-3.patch, HDFS-11804-trunk-4.patch, HDFS-11804-trunk-5.patch, 
> HDFS-11804-trunk-6.patch, HDFS-11804-trunk-7.patch, HDFS-11804-trunk.patch
>
>
> The kms client appears to have no retry logic – at all.  It's completely 
> decoupled from the ipc retry logic.  This has major impacts if the KMS is 
> unreachable for any reason, including but not limited to network connection 
> issues, timeouts, the +restart during an upgrade+.
> This has some major ramifications:
> # Jobs may fail to submit, although oozie resubmit logic should mask it
> # Non-oozie launchers may experience higher rates if they do not already have 
> retry logic.
> # Tasks reading EZ files will fail, probably be masked by framework reattempts
> # EZ file creation fails after creating a 0-length file – client receives 
> EDEK in the create response, then fails when decrypting the EDEK
> # Bulk hadoop fs copies, and maybe distcp, will prematurely fail



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to