[jira] [Comment Edited] (HADOOP-14521) KMS client needs retry logic
[ https://issues.apache.org/jira/browse/HADOOP-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055963#comment-16055963 ] Xiao Chen edited comment on HADOOP-14521 at 6/20/17 3:43 PM: - bq. pre-commit I tried to ran {{TestAclsEndToEnd}} before and after applying this patch on branch-2.8. It was passing before, failing after. Could you try the same? Perhaps something odd happening on my local. (I didn't look inside except for running the tests) bq. I don't fully understand your comment. Your patch is cool, just the tool was confused due to the naming. {{smart-apply-patch}} uses the wiki's naming convention to choose patch for trunk, branch-2, branch-2.8 etc. I was saying running {{dev-support/bin/smart-apply-patch HADOOP-14521}} was trying to apply the branch-2.8 patch when running on trunk: {quote} $ ./dev-support/bin/smart-apply-patch HADOOP-14521 Processing: HADOOP-14521 HADOOP-14521 patch is being downloaded at Tue Jun 20 08:31:28 PDT 2017 from https://issues.apache.org/jira/secure/attachment/12873350/HDFS-11804-branch-2.8.patch -> Downloaded ERROR: Aborting! HADOOP-14521 cannot be verified. {quote} Maybe we can improve yetus on that since as you said, pre-commit ran fine. This is NBD for this jira as I said, since I can download and apply the patch manually. was (Author: xiaochen): bq. pre-commit I tried to ran {{TestAclsEndToEnd}} before and after applying this patch on branch-2.8. It was passing before, failing after. Could you try the same? Perhaps something odd happening on my local. (I didn't look inside except for running the tests) bq. I don't fully understand your comment. Your patch is cool, just the tool was confused due to the naming. {{smart-apply-patch}} uses the wiki's naming convention to choose patch for trunk, branch-2, branch-2.8 etc. I was saying running {{dev-support/bin/smart-apply-patch HADOOP-14521}} was trying to apply the branch-2.8 patch when running on trunk: {quote} $ ./dev-support/bin/smart-apply-patch HADOOP-14521 Processing: HADOOP-14521 HADOOP-14521 patch is being downloaded at Tue Jun 20 08:31:28 PDT 2017 from https://issues.apache.org/jira/secure/attachment/12873350/HDFS-11804-branch-2.8.patch -> Downloaded ERROR: Aborting! HADOOP-14521 cannot be verified. {quote} Maybe we can improve yetus on that since as you said, pre-commit ran fine. > KMS client needs retry logic > > > Key: HADOOP-14521 > URL: https://issues.apache.org/jira/browse/HADOOP-14521 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.6.0 >Reporter: Rushabh S Shah >Assignee: Rushabh S Shah > Attachments: HADOOP-14521.09.patch, HDFS-11804-branch-2.8.patch, > HDFS-11804-trunk-1.patch, HDFS-11804-trunk-2.patch, HDFS-11804-trunk-3.patch, > HDFS-11804-trunk-4.patch, HDFS-11804-trunk-5.patch, HDFS-11804-trunk-6.patch, > HDFS-11804-trunk-7.patch, HDFS-11804-trunk-8.patch, HDFS-11804-trunk.patch > > > The kms client appears to have no retry logic – at all. It's completely > decoupled from the ipc retry logic. This has major impacts if the KMS is > unreachable for any reason, including but not limited to network connection > issues, timeouts, the +restart during an upgrade+. > This has some major ramifications: > # Jobs may fail to submit, although oozie resubmit logic should mask it > # Non-oozie launchers may experience higher rates if they do not already have > retry logic. > # Tasks reading EZ files will fail, probably be masked by framework reattempts > # EZ file creation fails after creating a 0-length file – client receives > EDEK in the create response, then fails when decrypting the EDEK > # Bulk hadoop fs copies, and maybe distcp, will prematurely fail -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-14521) KMS client needs retry logic
[ https://issues.apache.org/jira/browse/HADOOP-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048566#comment-16048566 ] Rushabh S Shah edited comment on HADOOP-14521 at 6/14/17 1:08 AM: -- bq. We can either update the message, or update the maxNumRetries to providers.length to make sure we try at least once on all providers. Either way feels okay to me. I wouldn't worry much about the log message not being entirely accurate. If the user is making the number of retries less than the number of providers then he/she knows the problems associated with it. {quote} Slightly prefer the latter since by assumption all providers should work and we don't have retry delay for the whole sweep. {quote} Actually I don't prefer overriding the config value which user intentionally overrode it. If you strongly feel about updating the log message, then I can change the log message in the next revision and will add the numRetries and providers length. Let me know you thoughts. [~xiaochen]: As always thanks a lot for your time reviewing all the updated patches. was (Author: shahrs87): bq. We can either update the message, or update the maxNumRetries to providers.length to make sure we try at least once on all providers. Either way feels okay to me. I wouldn't worry much about the error message not being entirely accurate. If the user is making the number of retries less than the number of providers then he/she knows the problems associated with it. {quote} Slightly prefer the latter since by assumption all providers should work and we don't have retry delay for the whole sweep. {quote} Actually I don't prefer overriding the config value which user intentionally overrode it. If you strongly feel about updating the log message, then I can change the log message in the next revision and will add the numRetries and providers length. Let me know you thoughts. [~xiaochen]: As always thanks a lot for your time reviewing all the updated patches. > KMS client needs retry logic > > > Key: HADOOP-14521 > URL: https://issues.apache.org/jira/browse/HADOOP-14521 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.6.0 >Reporter: Rushabh S Shah >Assignee: Rushabh S Shah > Attachments: HDFS-11804-trunk-1.patch, HDFS-11804-trunk-2.patch, > HDFS-11804-trunk-3.patch, HDFS-11804-trunk-4.patch, HDFS-11804-trunk-5.patch, > HDFS-11804-trunk-6.patch, HDFS-11804-trunk-7.patch, HDFS-11804-trunk.patch > > > The kms client appears to have no retry logic – at all. It's completely > decoupled from the ipc retry logic. This has major impacts if the KMS is > unreachable for any reason, including but not limited to network connection > issues, timeouts, the +restart during an upgrade+. > This has some major ramifications: > # Jobs may fail to submit, although oozie resubmit logic should mask it > # Non-oozie launchers may experience higher rates if they do not already have > retry logic. > # Tasks reading EZ files will fail, probably be masked by framework reattempts > # EZ file creation fails after creating a 0-length file – client receives > EDEK in the create response, then fails when decrypting the EDEK > # Bulk hadoop fs copies, and maybe distcp, will prematurely fail -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-14521) KMS client needs retry logic
[ https://issues.apache.org/jira/browse/HADOOP-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047355#comment-16047355 ] Rushabh S Shah edited comment on HADOOP-14521 at 6/13/17 2:52 AM: -- bq. For the retries, do you think maybe it's more intuitive to specify num of failovers for the whole provider set? >From [this comment, point >1|https://issues.apache.org/jira/browse/HADOOP-14521?focusedCommentId=16036988&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16036988], > I think [~daryn] had a good reason explaining why have fix number of retries >is a good idea. I kind of agreed with him. So keeping the fix number of >retries even in patch #7. bq. KMSCP: the first @Link seems redundant Fixed in v7 bq. Good to have some validations on the config params (e.g. numRetries > 0 etc.) Added few pre conditions checks for all the config params. bq.I'd revert the change that breaks instead of returning. Thanks a lot for catching that. Fixed that in v7 of the patch. was (Author: shahrs87): bq. For the retries, do you think maybe it's more intuitive to specify num of failovers for the whole provider set? >From [this comment, point >1|https://issues.apache.org/jira/browse/HADOOP-14521?focusedCommentId=16036988&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16036988], > I think [~daryn] had a good reason explaining why have fix number of retries >is a good idea. I kind of agreed with him. So keeping the fix number of >retries even in patch #7. bq. KMSCP: the first @Link seems redundant Fixed in v7 bq. Good to have some validations on the config params (e.g. numRetries > 0 etc.) Added few pre conditions checks for all the config params. > KMS client needs retry logic > > > Key: HADOOP-14521 > URL: https://issues.apache.org/jira/browse/HADOOP-14521 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 2.6.0 >Reporter: Rushabh S Shah >Assignee: Rushabh S Shah > Attachments: HDFS-11804-trunk-1.patch, HDFS-11804-trunk-2.patch, > HDFS-11804-trunk-3.patch, HDFS-11804-trunk-4.patch, HDFS-11804-trunk-5.patch, > HDFS-11804-trunk-6.patch, HDFS-11804-trunk-7.patch, HDFS-11804-trunk.patch > > > The kms client appears to have no retry logic – at all. It's completely > decoupled from the ipc retry logic. This has major impacts if the KMS is > unreachable for any reason, including but not limited to network connection > issues, timeouts, the +restart during an upgrade+. > This has some major ramifications: > # Jobs may fail to submit, although oozie resubmit logic should mask it > # Non-oozie launchers may experience higher rates if they do not already have > retry logic. > # Tasks reading EZ files will fail, probably be masked by framework reattempts > # EZ file creation fails after creating a 0-length file – client receives > EDEK in the create response, then fails when decrypting the EDEK > # Bulk hadoop fs copies, and maybe distcp, will prematurely fail -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org