[
https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361208#comment-14361208
]
Colin Patrick McCabe commented on HDFS-7915:
--------------------------------------------
bq. 1. I think we should look harder in logging a reason when having to
unregister a slot for better supportability (e.g., we want to find out the root
cause). I agree that to make it 100% right would result in too complex logic
though. I would propose the following:
I understand your concerns, but every log I've looked at does display the
reason why the fd passing failed, including the full exception. It simply is
logged in a catch block further up in the DataXceiver. Logging it again in
this function would just be repetitious. Sorry if that was unclear.
bq. 2. question: change in BlockReaderFactory.java to move "return new
ShortCircuitReplicaInfo(replica);" to within the try block is not important, I
mean, it's ok not to move it, correct?
Yes, it is OK not to move it, because currently the ShortCircuitReplicaInfo
can't fail (never throws). But it is better to have it in the catch block in
case the constructor later has a throw... added to it. It is safer.
bq. suggest to change sock.getOutputStream().write((byte).. to
sock.getOutputStream().write((int), since we are using {{DomainSocket#public
void write(int val) throws IOException }} API.
OK
bq. Should we define "0" as an constant somewhere and check equivalence instead
of "val < 0" at the reader?
It's not necessary. We don't care what the value is. Adding checks is
actually bad because it means we can't decide to use it later for some other
purpose.
bq. Looks to me that the message should be "Reading receipt byte for ...".
right?
thanks, fixed
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell
> the DFSClient about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-7915
> URL: https://issues.apache.org/jira/browse/HDFS-7915
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch,
> HDFS-7915.004.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell
> the DFSClient about it because of a network error. In
> {{DataXceiver#requestShortCircuitFds}}, the DataNode can succeed at the first
> part (mark the slot as used) and fail at the second part (tell the DFSClient
> what it did). The "try" block for unregistering the slot only covers a
> failure in the first part, not the second part. In this way, a divergence can
> form between the views of which slots are allocated on DFSClient and on
> server.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)