[ 
https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359919#comment-14359919
 ] 

Yongjun Zhang commented on HDFS-7915:
-------------------------------------

HI Colin,

Nice find of the new problem!

I did a review of rev04, sorry for a long post here:

1.  I think we should look harder in logging a reason when having to unregister 
a slot for better supportability (e.g., we want to find out the root cause). I 
agree that to make it 100% right would result in too complex logic though.  I 
would propose the following:

We don't have to capture runtime exception, instead, we can just care about 
IOException. Since bld already records reason for the different scenarios, we 
can add one more try/catch for the remaining:

We can add code to remember whatever exception recorded to bld as 
stage1Exception, and the other part stage2Exception.

{code}
    IOException stage1Exception = null;
    IOException stage2Exception = null;
        ...
    // add code to record the stage1Exception
    ...
    try {
        bld.build().writeDelimitedTo(socketOut);
        if (fis != null) {
          ......
          if (supportsReceiptVerification) {
            LOG.trace("Sending receipt verification byte for " + slotId);
            int val = sock.getInputStream().read();
            if (val < 0) {
              throw new EOFException("No verification byte received"); <== Add 
a message to this exception
            }
          } else {
            LOG.trace("Receipt verification is not enabled on the DataNode.  " +
                "Not verifying " + slotId);
          }
          success = true;
        }
    } catch (IOException e) {
       otherException = e;
       throw e;
    }
{code}
Notice I also added a message to the EOFException thrown there.

{code}
      if ((!success) && (registeredSlotId != null)) {
        String errMsg =
            (bld.getStatus() != SUCCESS)? bld.getMessage() :
              ((stage2Exception != null)?
                  stage2Exception.getMessage() : "unknown");
        LOG.info("Unregistering " + registeredSlotId + " because the " +
            "requestShortCircuitFdsForRead operation failed (" + errMsg + ")");
        datanode.shortCircuitRegistry.unregisterSlot(registeredSlotId);
        if (LOG.isDebugEnabled()) {
            if (stage1Exception != null) {
              LOG.debug("requestShortCircuitFds stage1Exception: " + 
StringUtils.stringifyException(stage1Exception));
        }
        if (stage2Exception != null) {
              LOG.debug("requestShortCircuitFds stage2Exception: " + 
StringUtils.stringifyException(stage2Exception));
        }
        }
      }
{code}

?

We can actually use a single exception variable for stage1 and stage2 because 
only one would be assigned at the time of logging.  I just want to throw some 
thoughts here.

2. question: change in BlockReaderFactory.java to move 
   "return new ShortCircuitReplicaInfo(replica);" to within the try block
is not important, I mean, it's ok not to move it, correct?

3. About the receipt verification:

{code}
        // writer
        if (buf[0] == SUPPORTS_RECEIPT_VERIFICATION.getNumber()) {
          LOG.trace("Sending receipt verification byte for slot " + slot);
          sock.getOutputStream().write((byte)0); 
        }
        =======================
        // reader
        int val = sock.getInputStream().read();
        if (val < 0) {
            throw new EOFException();
          }
{code}
* suggest to change {{sock.getOutputStream().write((byte)..}} to 
{{sock.getOutputStream().write((int)}}, since we are using 
{{DomainSocket#public void write(int val) throws IOException }} API.

* Should we define "0" as an constant somewhere and check equivalence instead 
of "val < 0" at the reader?

4. 
{code}
 LOG.trace("Sending receipt verification byte for " + slotId);
          int val = sock.getInputStream().read();
{code}
Looks to me that the message should be "Reading receipt byte for ...". right?

Thanks.


> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell 
> the DFSClient about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7915
>                 URL: https://issues.apache.org/jira/browse/HDFS-7915
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch, 
> HDFS-7915.004.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell 
> the DFSClient about it because of a network error.  In 
> {{DataXceiver#requestShortCircuitFds}}, the DataNode can succeed at the first 
> part (mark the slot as used) and fail at the second part (tell the DFSClient 
> what it did). The "try" block for unregistering the slot only covers a 
> failure in the first part, not the second part. In this way, a divergence can 
> form between the views of which slots are allocated on DFSClient and on 
> server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to