[ 
https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358092#comment-14358092
 ] 

Yongjun Zhang commented on HDFS-7915:
-------------------------------------

Hi Colin,

Thanks for the new rev. I found I made a mistake when doing earlier test, I 
need to include -Pnative as compile switch to enable to test. After I do that, 
I can see the test fail even with rev 001 after reverting DataXceiver.java. Did 
you speculate the problem when making rev 2?

Some additional comments:

{code}
      fis = datanode.requestShortCircuitFdsForRead(blk, token, maxVersion);
      bld.setStatus(SUCCESS);
      bld.setShortCircuitAccessVersion(DataNode.CURRENT_BLOCK_FORMAT_VERSION);
{code}
Here {{bld}} is set to SUCCESS status, without checking whether fis is null or 
not. However, down in the code below:
{code}
 if (fis != null) {
        FileDescriptor fds[] = new FileDescriptor[fis.length];
        ......
        success = true;
 }
{code}
{{success}} is set to true only when {{fis}} is not null. I saw a bit 
inconsistency here. Is it success when fis is null? If not, then the first 
section has an issue. If yes, then we can probably change {{success}} to 
{{isFisObtained}}.

It seems when we do the logging below
{code}
   if ((!success) && (registeredSlotId != null)) {
        LOG.info("Unregistering " + registeredSlotId + " because the " +
            "requestShortCircuitFdsForRead operation failed.");
        datanode.shortCircuitRegistry.unregisterSlot(registeredSlotId);
      }
{code}
The reason that we have to unregister a slot could be an exception recorded in 
{{bld}}, or because of an exception not currently caught in this method. 

I think we can add code to capture the currently uncaught exception, remember 
it, then re-throw it. Such that when we do the logging above in the final 
block, we can report this exception as the reason why we are un-registering the 
slot in this log.

What do you think?

Thanks.






> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell 
> the DFSClient about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7915
>                 URL: https://issues.apache.org/jira/browse/HDFS-7915
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell 
> the DFSClient about it because of a network error.  In 
> {{DataXceiver#requestShortCircuitFds}}, the DataNode can succeed at the first 
> part (mark the slot as used) and fail at the second part (tell the DFSClient 
> what it did). The "try" block for unregistering the slot only covers a 
> failure in the first part, not the second part. In this way, a divergence can 
> form between the views of which slots are allocated on DFSClient and on 
> server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to