[
https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359919#comment-14359919
]
Yongjun Zhang commented on HDFS-7915:
-------------------------------------
HI Colin,
Nice find of the new problem!
I did a review of rev04, sorry for a long post here:
1. I think we should look harder in logging a reason when having to unregister
a slot for better supportability (e.g., we want to find out the root cause). I
agree that to make it 100% right would result in too complex logic though. I
would propose the following:
We don't have to capture runtime exception, instead, we can just care about
IOException. Since bld already records reason for the different scenarios, we
can add one more try/catch for the remaining:
We can add code to remember whatever exception recorded to bld as
stage1Exception, and the other part stage2Exception.
{code}
IOException stage1Exception = null;
IOException stage2Exception = null;
...
// add code to record the stage1Exception
...
try {
bld.build().writeDelimitedTo(socketOut);
if (fis != null) {
......
if (supportsReceiptVerification) {
LOG.trace("Sending receipt verification byte for " + slotId);
int val = sock.getInputStream().read();
if (val < 0) {
throw new EOFException("No verification byte received"); <== Add
a message to this exception
}
} else {
LOG.trace("Receipt verification is not enabled on the DataNode. " +
"Not verifying " + slotId);
}
success = true;
}
} catch (IOException e) {
otherException = e;
throw e;
}
{code}
Notice I also added a message to the EOFException thrown there.
{code}
if ((!success) && (registeredSlotId != null)) {
String errMsg =
(bld.getStatus() != SUCCESS)? bld.getMessage() :
((stage2Exception != null)?
stage2Exception.getMessage() : "unknown");
LOG.info("Unregistering " + registeredSlotId + " because the " +
"requestShortCircuitFdsForRead operation failed (" + errMsg + ")");
datanode.shortCircuitRegistry.unregisterSlot(registeredSlotId);
if (LOG.isDebugEnabled()) {
if (stage1Exception != null) {
LOG.debug("requestShortCircuitFds stage1Exception: " +
StringUtils.stringifyException(stage1Exception));
}
if (stage2Exception != null) {
LOG.debug("requestShortCircuitFds stage2Exception: " +
StringUtils.stringifyException(stage2Exception));
}
}
}
{code}
?
We can actually use a single exception variable for stage1 and stage2 because
only one would be assigned at the time of logging. I just want to throw some
thoughts here.
2. question: change in BlockReaderFactory.java to move
"return new ShortCircuitReplicaInfo(replica);" to within the try block
is not important, I mean, it's ok not to move it, correct?
3. About the receipt verification:
{code}
// writer
if (buf[0] == SUPPORTS_RECEIPT_VERIFICATION.getNumber()) {
LOG.trace("Sending receipt verification byte for slot " + slot);
sock.getOutputStream().write((byte)0);
}
=======================
// reader
int val = sock.getInputStream().read();
if (val < 0) {
throw new EOFException();
}
{code}
* suggest to change {{sock.getOutputStream().write((byte)..}} to
{{sock.getOutputStream().write((int)}}, since we are using
{{DomainSocket#public void write(int val) throws IOException }} API.
* Should we define "0" as an constant somewhere and check equivalence instead
of "val < 0" at the reader?
4.
{code}
LOG.trace("Sending receipt verification byte for " + slotId);
int val = sock.getInputStream().read();
{code}
Looks to me that the message should be "Reading receipt byte for ...". right?
Thanks.
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell
> the DFSClient about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-7915
> URL: https://issues.apache.org/jira/browse/HDFS-7915
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch,
> HDFS-7915.004.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell
> the DFSClient about it because of a network error. In
> {{DataXceiver#requestShortCircuitFds}}, the DataNode can succeed at the first
> part (mark the slot as used) and fail at the second part (tell the DFSClient
> what it did). The "try" block for unregistering the slot only covers a
> failure in the first part, not the second part. In this way, a divergence can
> form between the views of which slots are allocated on DFSClient and on
> server.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)