[ 
https://issues.apache.org/jira/browse/HDFS-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833954#comment-16833954
 ] 

Erik Krogen commented on HDFS-12979:
------------------------------------

Hey [~vagarychen], though I agree this fixes things on the Observer side, I 
think we need to update the logic within {{StandbyCheckpointer}} as well. For 
starters, we have a field called {{activeNNAddresses}}, but it is really going 
to contain active and observer (and other standby nodes... it seems it should 
have been renamed when HDFS-6440 was completed). More importantly, today once a 
standby NN succeeds in uploading to a single NN, it will stop:
{code:java,name=StandbyCheckpointer}
    for (; i < uploads.size(); i++) {
      Future<TransferFsImage.TransferResult> upload = uploads.get(i);
      try {
        // TODO should there be some smarts here about retries nodes that are 
not the active NN?
        if (upload.get() == TransferFsImage.TransferResult.SUCCESS) {
          success = true;
          //avoid getting the rest of the results - we don't care since we had 
a successful upload
          break;
        }

      } catch (ExecutionException e) {
        ioe = new IOException("Exception during image upload", e);
        break;
      } catch (InterruptedException e) {
        ie = e;
        break;
      }
    }
{code}
We need to modify this to continue to monitor the success of all uploads, since 
a single Standby NN may need to upload to multiple locations.

> StandbyNode should upload FsImage to ObserverNode after checkpointing.
> ----------------------------------------------------------------------
>
>                 Key: HDFS-12979
>                 URL: https://issues.apache.org/jira/browse/HDFS-12979
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs
>            Reporter: Konstantin Shvachko
>            Assignee: Chen Liang
>            Priority: Major
>         Attachments: HDFS-12979.001.patch
>
>
> ObserverNode does not create checkpoints. So it's fsimage file can get very 
> old making bootstrap of ObserverNode too long. A StandbyNode should copy 
> latest fsimage to ObserverNode(s) along with ANN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to