haiyang1987 commented on PR #6266: URL: https://github.com/apache/hadoop/pull/6266#issuecomment-1818874038
> `InterruptedIOException` is a subClass of `IOException`. In the try_catch block in `getActiveNodeProxy`, we do catch for `IOException`. So, interruptedIOException will be captured there. > > https://docs.oracle.com/javase/8/docs/api/java/io/InterruptedIOException.html > > What I think should be happening is as following. > > ``` > main_Thread calling `triggerActiveLogRoll`, wait for 60 secs, timeout, cancel this task, and return. > > MultipleNameNodeProxy.call() thread: > -> getActiveNodeProxy() > -> nnLookup.next = ob2 (down node) > -> RPC.waitForProxy(ob2) > -> after 60 secs, interrupted. > -> output "Failed to reach ob2", increment nnLoopCount. Ideally, we should just stop here, since we already time out. > -> nnLookup.next = n1 (live node). > then, it should succeed to connect to n1. > ``` > > This does not seem to be the case from the logs you shared. thoughts? > > A possible fix might be to explicitly capture `InterruptedIOException` in `getActiveNodeProxy`, and just finish for this thread (assuming all `InterruptedIOExceptions` are invoked from `triggerActiveLogRoll`). For the following triggerActiveLogRoll calls, we should be good, since we will move the nnLookup to next one. Yeah, For MultipleNameNodeProxy#call() explicitly capture InterruptedIOException and then exit execution is also a solution -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org