[ 
https://issues.apache.org/jira/browse/ACCUMULO-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366804#comment-15366804
 ] 

Josh Elser commented on ACCUMULO-4359:
--------------------------------------

bq. I'm not sure about the best way to fix it, advice is welcome, but I'm 
thinking that a binary exponential backoff (maybe capped at 30s?) instead of a 
retry every 100ms would at least lighten the load on the tservers?

Yes, I completely agree with you on the timeout/retry logic.

It's also very difficult to separate RPC level exceptions from those that are 
retryable and those that are fatal.

Both of these would be a great place for improvements.

bq. I know there were some issues with older Hadoop versions... perhaps you 
need to update to 2.6.4 or later?

Nah, he's saying that they didn't launch a renewal thread for their Kerberos 
ticket. So, after their mapreduce job ran, they had an invalid ticket (it would 
be expected that they couldn't make an RPC). We just didn't fail when this 
happened, but sat in a loop spinning-fast on failures.

> Accumulo client stuck in infinite loop when Kerberos ticket expires
> -------------------------------------------------------------------
>
>                 Key: ACCUMULO-4359
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4359
>             Project: Accumulo
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.7.2
>         Environment: Problem only exists when Kerberos is turned on.
>            Reporter: Russ Weeks
>            Assignee: Russ Weeks
>            Priority: Minor
>             Fix For: 1.8.0
>
>
> If an Accumulo client tries to send an RPC to a tserver but the client's 
> token is expired, it will get stuck in an infinite loop 
> [here|https://github.com/apache/accumulo/blob/1.7/core/src/main/java/org/apache/accumulo/core/client/impl/ServerClient.java#L102].
> I'm setting the priority to "minor" because it's actually pretty difficult to 
> put the system into this state: you have to create the client with a valid 
> token, let the token expire, and then try to use the client. We hit this by 
> accident in the cleanup phase of a very long-running MR job; the workaround 
> (a.k.a the right way to do it) is to create a new client instead of re-using 
> an old client.
> On the tserver side, we get an exception like this every 100ms:
> {noformat}
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
> Peer indicated failure: GSS initiate failed
>       at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
>       at 
> org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51)
>       at 
> org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:360)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
>       at 
> org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48)
>       at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> On the client side, no output is produced unless debug logging is turned on 
> for o.a.a.core.client.impl.ServerClient, in which case you see a bunch of 
> "Failed to find TGT" errors.
> I'm not sure about the best way to fix it, advice is welcome, but I'm 
> thinking that a binary exponential backoff (maybe capped at 30s?) instead of 
> a retry every 100ms would at least lighten the load on the tservers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to