[
https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358092#comment-14358092
]
Yongjun Zhang commented on HDFS-7915:
-------------------------------------
Hi Colin,
Thanks for the new rev. I found I made a mistake when doing earlier test, I
need to include -Pnative as compile switch to enable to test. After I do that,
I can see the test fail even with rev 001 after reverting DataXceiver.java. Did
you speculate the problem when making rev 2?
Some additional comments:
{code}
fis = datanode.requestShortCircuitFdsForRead(blk, token, maxVersion);
bld.setStatus(SUCCESS);
bld.setShortCircuitAccessVersion(DataNode.CURRENT_BLOCK_FORMAT_VERSION);
{code}
Here {{bld}} is set to SUCCESS status, without checking whether fis is null or
not. However, down in the code below:
{code}
if (fis != null) {
FileDescriptor fds[] = new FileDescriptor[fis.length];
......
success = true;
}
{code}
{{success}} is set to true only when {{fis}} is not null. I saw a bit
inconsistency here. Is it success when fis is null? If not, then the first
section has an issue. If yes, then we can probably change {{success}} to
{{isFisObtained}}.
It seems when we do the logging below
{code}
if ((!success) && (registeredSlotId != null)) {
LOG.info("Unregistering " + registeredSlotId + " because the " +
"requestShortCircuitFdsForRead operation failed.");
datanode.shortCircuitRegistry.unregisterSlot(registeredSlotId);
}
{code}
The reason that we have to unregister a slot could be an exception recorded in
{{bld}}, or because of an exception not currently caught in this method.
I think we can add code to capture the currently uncaught exception, remember
it, then re-throw it. Such that when we do the logging above in the final
block, we can report this exception as the reason why we are un-registering the
slot in this log.
What do you think?
Thanks.
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell
> the DFSClient about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-7915
> URL: https://issues.apache.org/jira/browse/HDFS-7915
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell
> the DFSClient about it because of a network error. In
> {{DataXceiver#requestShortCircuitFds}}, the DataNode can succeed at the first
> part (mark the slot as used) and fail at the second part (tell the DFSClient
> what it did). The "try" block for unregistering the slot only covers a
> failure in the first part, not the second part. In this way, a divergence can
> form between the views of which slots are allocated on DFSClient and on
> server.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)