[
https://issues.apache.org/jira/browse/HDFS-11804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018923#comment-16018923
]
Rushabh S Shah edited comment on HDFS-11804 at 5/21/17 6:09 PM:
----------------------------------------------------------------
This is by no means a committable patch.
It needs some rework.
Changes done in this patch:
# Only LoadBalancingKMSClientProvider will be instantiated when createProvider
is being called.
# Added three new config keys similar to dfs client.
#* {{kms.client.failover.attempts.multiplier}}: This is the multiplier factor.
For example: if {{kms.client.failover.attempts.multiplier}} is 2 and there are
2 backend kms servers, then client will retry for maximum 4 times.
Default is 1 to maintain original behavior.
#* {{kms.client.failover.sleep.base.millis}}: To calculate failover sleep time.
Default is 500 (same as dfs client configuration)
#* {{kms.client.failover.sleep.max.millis}}: To calculate maximum amount of
time to sleep.
Default is 15000 (same as dfs client configuration)
# The client is configured to use {{FailoverOnNetworkExceptionRetry}} with
fallback policy as {{TryOnceThenFail}}.
On encountering AccessControlException or AuthorizationException, the client
will stop retrying. This assumes all the servers are configured with identical
permissions and key acls.
# Since the {{RetryPolicy#shouldRetry}} function expects whether the option is
idempotent or not, We need to decide what all operations are idempotent.
Below is my understanding.
#* IdempotentOperations: renewDelegationToken, cancelDelegationToken,
decryptEncryptedKey, getKeyVersion, getKeys, getKeysMetadata, getKeyVersions,
getCurrentKey, getMetadata
#* AtMostOnce: createKey, deleteKey, rollNewVersion
#* Not sure: addDelegationTokens, generateEncryptedKey,
reencryptEncryptedKey
Would like to know your views.
There are few TODOs in the patch. Would like to know your opinions on that also.
Thanks in advance for reviewing my patch.
was (Author: shahrs87):
This is by no means a committable patch.
It needs some rework.
Changes done in this patch:
1. Only LoadBalancingKMSClientProvider will be instantiated when createProvider
is being called.
2. Added three new config keys similar to dfs client.
* {{kms.client.failover.attempts.multiplier}}: This is the multiplier factor.
For example: if {{kms.client.failover.attempts.multiplier}} is 2 and there are
2 backend kms servers, then client will retry for maximum 4 times.
* {{kms.client.failover.sleep.base.millis}}: To calculate failover sleep time.
* {{kms.client.failover.sleep.max.millis}}: To calculate maximum amount of time
to sleep.
3. The client is configured to use {{FailoverOnNetworkExceptionRetry}} with
fallback policy as {{TryOnceThenFail}}.
On encountering AccessControlException or AuthorizationException, the client
will stop retrying. This assumes all the servers are configured with identical
permissions and key acls.
4. Since the {{RetryPolicy#shouldRetry}} function expects whether the option is
idempotent or not, We need to decide what all operations are idempotent.
Below is my understanding.
** IdempotentOperations: renewDelegationToken, cancelDelegationToken,
decryptEncryptedKey, getKeyVersion, getKeys, getKeysMetadata, getKeyVersions,
getCurrentKey, getMetadata
** AtMostOnce: createKey, deleteKey, rollNewVersion
** Not sure: addDelegationTokens, generateEncryptedKey,
reencryptEncryptedKey
Would like to know your views.
There are few TODOs in the patch. Would like to know your opinions on that also.
Thanks in advance for reviewing my patch.
> KMS client needs retry logic
> ----------------------------
>
> Key: HDFS-11804
> URL: https://issues.apache.org/jira/browse/HDFS-11804
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Rushabh S Shah
> Assignee: Rushabh S Shah
> Attachments: HDFS-11804-trunk.patch
>
>
> The kms client appears to have no retry logic – at all. It's completely
> decoupled from the ipc retry logic. This has major impacts if the KMS is
> unreachable for any reason, including but not limited to network connection
> issues, timeouts, the +restart during an upgrade+.
> This has some major ramifications:
> # Jobs may fail to submit, although oozie resubmit logic should mask it
> # Non-oozie launchers may experience higher rates if they do not already have
> retry logic.
> # Tasks reading EZ files will fail, probably be masked by framework reattempts
> # EZ file creation fails after creating a 0-length file – client receives
> EDEK in the create response, then fails when decrypting the EDEK
> # Bulk hadoop fs copies, and maybe distcp, will prematurely fail
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]