[ 
https://issues.apache.org/jira/browse/HDFS-11804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh S Shah updated HDFS-11804:
----------------------------------
    Attachment: HDFS-11804-trunk-2.patch

Attaching a new patching addressing all the comments by [~daryn] and [~xiaochen]

bq. Should we also treat AuthenticationException as non-retry?
Since AuthenticationException is not thrown by 
{{LoadBalancingKMSClientProvider#doOp}}, we can't catch it explicitly.
But it will be thrown as {{WrapperException}} and we don't retry.

bq. I think we should retry immediately for all 3 at first. (Same as current 
LBKMSCP behavior).
Addressed in the latest patch.

bq. Should just specify a max retries that defaults to the number of load 
balanced hosts (today's behavior) is probably better.
Addressed in the latest patch.

bq. The max sleep time seems rather high since it can block server threads for 
a long time. I'd suggest perhaps 1-2s.
Made it 2 seconds.

bq. Should probably only sleep after trying all clients instead of between each 
client.
Addressed in the latest patch.

bq. Separate try blocks for shouldRetry and sleep to better control exception 
rethrows.
Addressed in the latest patch.

bq. Change the sleep catch to rethrow InteruptedIOException.
Addressed in the latest patch.

bq. Trivial but there's a lot of double punctuation, ex. periods, exclamations.
Addressed in the latest patch.

One more major change change done in the latest patch.
Treat all the operations as non idempotent.
Since we catch all the Socket related exceptions in 
{{FailoverOnNetworkExceptionRetry#shouldRetry}} in the first catch block and 
Failover, it should be safe to assume that all the exception that passed the 
first catch block shouldn't be retried.

> KMS client needs retry logic
> ----------------------------
>
>                 Key: HDFS-11804
>                 URL: https://issues.apache.org/jira/browse/HDFS-11804
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>         Attachments: HDFS-11804-trunk-1.patch, HDFS-11804-trunk-2.patch, 
> HDFS-11804-trunk.patch
>
>
> The kms client appears to have no retry logic – at all.  It's completely 
> decoupled from the ipc retry logic.  This has major impacts if the KMS is 
> unreachable for any reason, including but not limited to network connection 
> issues, timeouts, the +restart during an upgrade+.
> This has some major ramifications:
> # Jobs may fail to submit, although oozie resubmit logic should mask it
> # Non-oozie launchers may experience higher rates if they do not already have 
> retry logic.
> # Tasks reading EZ files will fail, probably be masked by framework reattempts
> # EZ file creation fails after creating a 0-length file – client receives 
> EDEK in the create response, then fails when decrypting the EDEK
> # Bulk hadoop fs copies, and maybe distcp, will prematurely fail



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to