[jira] [Updated] (HDFS-5840) Follow-up to HDFS-5138 to improve error handling during partial upgrade failures

Aaron T. Myers (JIRA) Thu, 13 Feb 2014 01:34:25 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aaron T. Myers updated HDFS-5840:
---------------------------------

    Attachment: HDFS-5840.patch

Attaching a patch which should address all of the individual comments:

bq. assertAllResultsEqual - for loop can just start with i = 1? Also if the 
collection objects is of size zero or one, the method can return early. Is 
there a need to do object.toArray() for these early checks? With that, perhaps 
the findbugs exclude may not be necessary.

Good thinking re: starting the loop at 1 and exiting early. Even with that, I 
don't think the findbugs exclude can be removed. The issue is this: 
"{{(currElement == null && currElement != lastElement)}}". In that code we know 
that currElement is equal to null, and then we compare the known-null value to 
something else. This is correct, and even though I could have explicitly 
compared lastElement to null, this seemed clearer to me. Let me know if you 
disagree.

bq. Unit test can be added for methods isAtLeastOneActive, 
getRpcAddressesForNameserviceId and getProxiesForAllNameNodesInNameservice (I 
am okay if this is done in a separate jira)

Done. I added a single unit test which wouldn't work if any of the three of 
these were broken, since isAtLeastOneActive by definition uses the other two.

bq. Finalizing upgrade is quite tricky. Consider the following scenarios:

See below for comments on this.

bq. Why is throw new AssertionError("Unreachable code."); in 
QuorumJournalManager.java methods?

These methods have to have to either return something or unconditionally throw 
an exception. In fact this method will always either return a value (when all 
JNs are up and responsive) or else will throw an exception. However, the 
compiler can't discern this, so I put in these exceptions to placate it. It was 
Todd's suggestion to throw AssertionError specifically. See his comment on 
HDFS-5138 on 2014/01/07.

bq. FSImage#doRollBack() - when canRollBack is false after checking if 
non-share directories can rollback, an exception must be immediately thrown, 
instead of checking shared editlog. 

I don't think I agree with this. Imagine a scenario where an earlier rollback 
attempt had succeeded for all the local storage dirs, but not the shared edit 
log. Shouldn't we continue the rollback in this case?

bq. Also printing Log.info when storages can be rolled back will help in 
debugging.

Good thinking. Done.

bq. FSEditlog#canRollBackSharedLog should accept StorageInfo instead of Storage

Good catch. Done.

bq. QuorumJournalManager#canRollBack and getJournalCTime can throw 
AssertionError (from DFSUtil.assertAllResultsEqual()). Is that the right 
exception to expose or IOException?

I don't think it actually makes a functional difference, since in either case 
we bail out of the process, but I've changed it nonetheless since we tend to 
favor IOExceptions throughout the codebase.

bq. Namenode startup throws AssertionError with -rollback option. I think we 
should throw IOException, which is how all the other failures are indicated.

Note that this code shouldn't be reachable anymore from actually running the 
command since NameNode#doRollback doesn't actually start the NN daemon anymore 
- just performs the rollback and shuts down. I only left this in there to make 
sure that in the future we don't inadvertently reintroduce the old behavior 
somehow. Given this, I think we should leave it as an AssertionError. Do you 
agree?

I've also updated the documentation as you suggested.

As for handling the partial upgrade failure as you've described, I'd like to 
add one more RPC call to the JournalManager to initiate analysis/recovery of 
the storage dirs upon first contact, and then refactor the contents of 
FSImage#recoverStorageDirs into NNUpgradeUtil just like was done with the other 
upgrade-related procedures. If this sounds OK to you, I'll go ahead and add 
that stuff and appropriate tests.

> Follow-up to HDFS-5138 to improve error handling during partial upgrade 
> failures
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-5840
>                 URL: https://issues.apache.org/jira/browse/HDFS-5840
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.0.0
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>             Fix For: 3.0.0
>
>         Attachments: HDFS-5840.patch
>
>
> Suresh posted some good comment in HDFS-5138 after that patch had already 
> been committed to trunk. This JIRA is to address those. See the first comment 
> of this JIRA for the full content of the review.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5840) Follow-up to HDFS-5138 to improve error handling during partial upgrade failures

Reply via email to