[
https://issues.apache.org/jira/browse/HADOOP-11957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556757#comment-14556757
]
Colin Patrick McCabe commented on HADOOP-11957:
-----------------------------------------------
bq. if we got a socket error, then waiting for the count to go to zero will put
you on an infinite loop. What would you like to do ? Fail fast or wait forever ?
It isn't an infinite loop. The loop terminates when nobody else is using the
socket (the reference count goes to 0). If the reference count is not going to
0 even though there are no other users, there is a bug somewhere else where we
are incorrectly forgetting to decrement the reference count.
bq. we are in a bad situation when this happens, in the specific case I was
seeing ENOTCONN. You would be absolutely right, if the error was due to calling
socket-Shutdown in java code in DomainSocket.java, but this call it is to OS
layer and if the OS says it is an invalid socket, no one else can use it
either. That is having a reference count is not going to help anyway.
The reference count isn't to protect the socket, it's to protect the socket
number. As long as the reference count is nonzero, it is not safe to close the
socket, since someone might be in the process of making a call to that socket
number. Calling close will allow that socket number to be reused by the
operating system.
bq. On Mac - This failure happens in a specific test , where OS returns socket
is not connected error and we get stuck, we will eventually come out since the
test will time out. - This is the one that I was fixing, In fact I will create
a simple wrapper around shutdown0(so I can use mockito to inject errors) and
then call into the same call as you have proposed. But would like to break out
if we get an error like ENOTCONN - Socket not connected. At that point no one
can be using that socket anyway. From looking at the man pages -
http://man7.org/linux/man-pages/man2/shutdown.2.html - I get a feeling that if
a socket shutdown has failed, that socket will not be in use by anyone. They
would be getting errors while communicating is my feeling. Do you agree ? if
so, then breaking out seems to be the right thing to do.
If {{DomainSocket#connect}} fails, we should never get a DomainSocket object in
the first place.
{code}
public static DomainSocket connect(String path) throws IOException {
if (loadingFailureReason != null) {
throw new UnsupportedOperationException(loadingFailureReason);
}
int fd = connect0(path);
return new DomainSocket(path, fd);
}
{code}
I don't understand how you are having problems with an unconnected socket in
{{DomainSocket#close}}, since you can't even create a {{DomainSocket}} object
until {{connect0}} succeeds. It sounds like we need to dig deeper into this
issue, which may be a reference count problem somewhere. In any case, we most
certainly should not modify {{DomainSocket#close}}.
> if an IOException error is thrown in DomainSocket.close we go into infinite
> loop.
> ---------------------------------------------------------------------------------
>
> Key: HADOOP-11957
> URL: https://issues.apache.org/jira/browse/HADOOP-11957
> Project: Hadoop Common
> Issue Type: Bug
> Components: net
> Affects Versions: 2.7.1
> Reporter: Anu Engineer
> Assignee: Anu Engineer
> Attachments: HADOOP-11957.001.patch
>
>
> if an IOException error is thrown in DomainSocket.close we go into infinite
> loop.
> Issue : If the shutdown0(fd) call throws an IOException we break out of the
> if shutdown call but will continue to loop in the while loop infinitely since
> we have no way of decrementing the counter. Please scroll down and see the
> comment marked with Bug Bug to see where the issue is.
> {code:title=DomainSocket.java}
> @Override
> public void close() throws IOException {
> // Set the closed bit on this DomainSocket
> int count = 0;
> try {
> count = refCount.setClosed();
> } catch (ClosedChannelException e) {
> // Someone else already closed the DomainSocket.
> return;
> }
> // Wait for all references to go away
> boolean didShutdown = false;
> boolean interrupted = false;
> while (count > 0) {
> if (!didShutdown) {
> try {
> // Calling shutdown on the socket will interrupt blocking system
> // calls like accept, write, and read that are going on in a
> // different thread.
> shutdown0(fd);
> } catch (IOException e) {
> LOG.error("shutdown error: ", e);
> }
> didShutdown = true;
> // *BUG BUG* <-- Here the code will never exit the loop
> // if the count is greater then 0. we need to break out
> // of the while loop in case of IOException Error
> }
> try {
> Thread.sleep(10);
> } catch (InterruptedException e) {
> interrupted = true;
> }
> count = refCount.getReferenceCount();
> }
> // At this point, nobody has a reference to the file descriptor,
> // and nobody will be able to get one in the future either.
> // We now call close(2) on the file descriptor.
> // After this point, the file descriptor number will be reused by
> // something else. Although this DomainSocket object continues to hold
> // the old file descriptor number (it's a final field), we never use it
> // again because this DomainSocket is closed.
> close0(fd);
> if (interrupted) {
> Thread.currentThread().interrupt();
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)