[jira] [Commented] (HADOOP-13604) Abort retry loop when RPC has an unrecoverable Auth error

2016-09-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506865#comment-15506865
 ] 

Steve Loughran commented on HADOOP-13604:
-

Thank  you for volunteering: created HADOOP-13627 for you

Bear in mind we are all scared of the code and changes breaking things; keep 
the diffs minimal, and don't change the text messages we have today. Not 
because they are good, but because they are searchable in existing JIRAs and 
Stack Overflow topics

> Abort retry loop when RPC has an unrecoverable Auth error
> -
>
> Key: HADOOP-13604
> URL: https://issues.apache.org/jira/browse/HADOOP-13604
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Henry Robinson
>Assignee: Xiao Chen
>
> I've seen an issue where, after an RPC client hit an error obtaining a TGT 
> from Kerberos, the RPC client continues to retry even though there's no 
> chance of success (the no login window is set to 600s).
> In this particular deployment, the client retries 15 times at 15s intervals, 
> leading to a delay of more than three minutes before the failure is bubbled 
> up to the client when the RPC ultimately fails.
> Unrecoverable errors (like failures to login to Kerberos) should lead to fast 
> aborts of the retry loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13604) Abort retry loop when RPC has an unrecoverable Auth error

2016-09-19 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505134#comment-15505134
 ] 

Xiao Chen commented on HADOOP-13604:


Thanks [~ste...@apache.org] for the pointer! I linked another TWMNBN related 
issue, HADOOP-13590, to HADOOP-12649 as well. :)

I agree it would be difficult and non-optimal to filter Auth errors from 
current IOEs. Is there a specific jira for fixing this?

> Abort retry loop when RPC has an unrecoverable Auth error
> -
>
> Key: HADOOP-13604
> URL: https://issues.apache.org/jira/browse/HADOOP-13604
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Henry Robinson
>Assignee: Xiao Chen
>
> I've seen an issue where, after an RPC client hit an error obtaining a TGT 
> from Kerberos, the RPC client continues to retry even though there's no 
> chance of success (the no login window is set to 600s).
> In this particular deployment, the client retries 15 times at 15s intervals, 
> leading to a delay of more than three minutes before the failure is bubbled 
> up to the client when the RPC ultimately fails.
> Unrecoverable errors (like failures to login to Kerberos) should lead to fast 
> aborts of the retry loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13604) Abort retry loop when RPC has an unrecoverable Auth error

2016-09-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503842#comment-15503842
 ] 

Steve Loughran commented on HADOOP-13604:
-

Renamed to make clear it's about auth: the problem is two fold: determining 
that there's an unrecoverable and then rejecting it. The retry handler is meant 
to retry on some things, ConnectionRefused, etc, but k-auth problems are 
unlikely to go away. That said, some network problems (especially DNS) tend not 
to resolve on their own.

w.r.t Auth problems, the fact that UGI tends to generate IOEs rather than 
anything you can filter on is going to have to be fixed first

> Abort retry loop when RPC has an unrecoverable Auth error
> -
>
> Key: HADOOP-13604
> URL: https://issues.apache.org/jira/browse/HADOOP-13604
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Henry Robinson
>Assignee: Xiao Chen
>
> I've seen an issue where, after an RPC client hit an error obtaining a TGT 
> from Kerberos, the RPC client continues to retry even though there's no 
> chance of success (the no login window is set to 600s).
> In this particular deployment, the client retries 15 times at 15s intervals, 
> leading to a delay of more than three minutes before the failure is bubbled 
> up to the client when the RPC ultimately fails.
> Unrecoverable errors (like failures to login to Kerberos) should lead to fast 
> aborts of the retry loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org