[
https://issues.apache.org/jira/browse/HADOOP-11957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551062#comment-14551062
]
Anu Engineer commented on HADOOP-11957:
---------------------------------------
bq. Thanks for looking at this, Anu Engineer. The patch is not correct because
right now it unconditionally breaks out of the loop without waiting for the
socket reference count to drop to 0. This is very dangerous because it means
that we may call close on a file descriptor while another thread is still using
it. If this happens, a third thread may open another file, which will get the
same file descriptor number as the one which was just closed. And then the
second thread is doing some arbitrary operation on the wrong file descriptor.
if we got a socket error, then waiting for the count to go to zero will put you
on an infinite loop. What would you like to do ? Fail fast or wait forever ?
I was seeing calls to shutdown0(native) would fail, in this specific case
shutdown0 failed with socket not connected error. if you look at the change
that I proposed(the test failed) -- was to try to call shutdown with the
counter getting decremented, instead of the current call where we call into
shutdown0. I have not investigated why the specific test case failed.
bq. To be honest, I do not see any way of handling this better than we
currently do. If shutdown fails, we simply have to block until the socket is no
longer in use.
if the shutdown fails, we should bail since current call is to the OS native
call layer, then the assumption that failure happened due to someone else is
using socket - AFAIK is not correct. We should be able to shutdown a socket
even if someone is using it , that user will get an error is a different
problem.
bq. There is no possible shortcut that I can see.
we are in a bad situation when this happens, in the specific case I was seeing
ENOTCONN. You would be absolutely right, if the error was due to calling
socket-Shutdown in java code in DomainSocket.java, but this call it is to OS
layer and if the OS says it is an invalid socket, no one else can use it
either. That is having a reference count is not going to help anyway.
bq. Did you encounter this problem in production? If so, what was the shutdown
error?
Production - No, I don't own any clusters unfortunately. During bug bash I was
trying to shepherd one of the bugs from [~cnauroth] , where he had fixed an
issue to get the native code compiling on mac HDFS-3296. But did not get
complete it because socket calls were failing. We found 3 separate bugs in that
path. I will file those JIRAs ASAP.
* On Mac - Domain Sockets Path cannot be more then 103 bytes and for some
reason getting /tmp + date time stamp will make the path longer , because /tmp
is a symbolic link. - I have a simple fix for this.
* On Mac - This failure happens in a specific test , where OS returns socket is
not connected error and we get stuck, we will eventually come out since the
test will time out. - This is the one that I was fixing, In fact I will create
a simple wrapper around shutdown0(so I can use mockito to inject errors) and
then call into the same call as you have proposed. But would like to break out
if we get an error like ENOTCONN - Socket not connected. At that point no one
can be using that socket anyway. From looking at the man pages -
http://man7.org/linux/man-pages/man2/shutdown.2.html - I get a feeling that if
a socket shutdown has failed, that socket will not be in use by anyone. They
would be getting errors while communicating is my feeling. Do you agree ? if
so, then breaking out seems to be the right thing to do.
* On Mac - There is another bug, where we ask the native layer to accept a
connection with timeout and that call also fails, I looked at the code and did
not see anything really wrong. I need to put it under a debugger. Hopefully
once we fix these three bugs, we can run native library on Mac
> if an IOException error is thrown in DomainSocket.close we go into infinite
> loop.
> ---------------------------------------------------------------------------------
>
> Key: HADOOP-11957
> URL: https://issues.apache.org/jira/browse/HADOOP-11957
> Project: Hadoop Common
> Issue Type: Bug
> Components: net
> Affects Versions: 2.7.1
> Reporter: Anu Engineer
> Assignee: Anu Engineer
> Attachments: HADOOP-11957.001.patch
>
>
> if an IOException error is thrown in DomainSocket.close we go into infinite
> loop.
> Issue : If the shutdown0(fd) call throws an IOException we break out of the
> if shutdown call but will continue to loop in the while loop infinitely since
> we have no way of decrementing the counter. Please scroll down and see the
> comment marked with Bug Bug to see where the issue is.
> {code:title=DomainSocket.java}
> @Override
> public void close() throws IOException {
> // Set the closed bit on this DomainSocket
> int count = 0;
> try {
> count = refCount.setClosed();
> } catch (ClosedChannelException e) {
> // Someone else already closed the DomainSocket.
> return;
> }
> // Wait for all references to go away
> boolean didShutdown = false;
> boolean interrupted = false;
> while (count > 0) {
> if (!didShutdown) {
> try {
> // Calling shutdown on the socket will interrupt blocking system
> // calls like accept, write, and read that are going on in a
> // different thread.
> shutdown0(fd);
> } catch (IOException e) {
> LOG.error("shutdown error: ", e);
> }
> didShutdown = true;
> // *BUG BUG* <-- Here the code will never exit the loop
> // if the count is greater then 0. we need to break out
> // of the while loop in case of IOException Error
> }
> try {
> Thread.sleep(10);
> } catch (InterruptedException e) {
> interrupted = true;
> }
> count = refCount.getReferenceCount();
> }
> // At this point, nobody has a reference to the file descriptor,
> // and nobody will be able to get one in the future either.
> // We now call close(2) on the file descriptor.
> // After this point, the file descriptor number will be reused by
> // something else. Although this DomainSocket object continues to hold
> // the old file descriptor number (it's a final field), we never use it
> // again because this DomainSocket is closed.
> close0(fd);
> if (interrupted) {
> Thread.currentThread().interrupt();
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)