[ https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingliang Liu updated HDFS-9129: -------------------------------- Attachment: HDFS-9129.010.patch Thank you [~wheat9] for your review. The v10 patch addresses the comments, along with rebasing from {{trunk}} branch. As there are conflicts because of [HDFS-4015], I also invited [~arpitagarwal] and [~anu] to review the patch. See response inline to [~wheat9]'s comments. {quote} 1. It does not give much information compared to figuring out the issues on the code directly. What does "thresholds met" / "extensions reached" mean? It causes more confusions than explanations. {quote} The motivation is that diagram is helpful for glimpse, if we can provide definition of "thresholds met" / "extension reached". In v10 patch, I add more explanations in the comments. {quote} {code} LOG.error("Non-recognized block manager safe mode status: {}", status); {code} 2. Should be an assert. {quote} Truely, I'll simply {{assert false : "some comment"}}. {quote} {code} private volatile boolean isInManualSafeMode = false; private volatile boolean isInResourceLowSafeMode = false; ... isInManualSafeMode = !resourcesLow; isInResourceLowSafeMode = resourcesLow; {code} 3. How do these two variables synchronize? Is the system in consistent state in the middle of the execution? {quote} Good question. Actually it's not in consistent state in the middle of the execution. If the {{resourceLow}} is true, and before that name node is in manual safe mode, in the middle of the execution {{isInSafeMode}} will return false, which means the safe mode is OFF. The main reason is that writing to the two variables (aka enter/leave safemode) is guarded by the FSNamesystem write lock, while read is not. The enum type state was replace with two boolean flags in the v7 patch, because the two-lay state machine was cumbersome per offline discussion with [~wheat9] and [~jingzhao]. Guarding all the read looks expensive. Bitwise operation on a flag variable seems tricky. The new design goes back to the {{trunk}} logic, which makes the block manager stays in safe mode in the middle of the execution: {code} if (resourcesLow) { isInResourceLowSafeMode = true; } else { isInManualSafeMode = true; } {code} In case both {{isInManualSafeMode}} and {{isInResourceLowSafeMode}} are true, {{isInResourceLowSafeMode}} will short-circuit {{isInManualSafeMode}}, according to our current logic, e.g. {{getTurnOffTip()}}. {quote} {code} + // INITIALIZED -> THRESHOLD + bmSafeMode.setBlockTotal(BLOCK_TOTAL); + assertEquals(BMSafeModeStatus.THRESHOLD, getSafeModeStatus()); + assertTrue(bmSafeMode.isInSafeMode()); {code} 4. It makes sense to put it in a test instead of in the @Before method. {quote} That makes sense to me. I'll add a new test called {{testSetBlockTotal}}. {quote} {code} + // EXTENSION -> OFF + Whitebox.setInternalState(bmSafeMode, "status", BMSafeModeStatus.EXTENSION); + reachBlockThreshold(); + reachExtension(); + bmSafeMode.checkSafeMode(); + assertEquals(BMSafeModeStatus.OFF, getSafeModeStatus()); {code} 5. Please refactor the code – you can reuse the getSafeModeStatus() that is defined by the class. {quote} Actually the first statment for each state transition tests is to *set* the internal state. I'll make the fields package accessible so the Whitebox is not needed. See the end of this comment. {quote} {code} assertEquals(getBlockSafe(), i); {code} 6. The expected value should go first according to the function signature. {quote} Nice catch. I will revise the whole test to fix all similar ones. {quote} 7. A higher level question is that why it needs so many getInternalState() statements? It looks to me that many of these behaviors can be observed outside without whitebox testing. {quote} The reason was that *all* non-static fields in {{BlockManagerSafeMode}} is private. By design, I made them _private_ because we don't need access to its internal state in {{BlockManager}}. In the new design, the v10 patch simply makes the fields package private so the test will be more straight-forward. > Move the safemode block count into BlockManager > ----------------------------------------------- > > Key: HDFS-9129 > URL: https://issues.apache.org/jira/browse/HDFS-9129 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Haohui Mai > Assignee: Mingliang Liu > Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, > HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch, > HDFS-9129.005.patch, HDFS-9129.006.patch, HDFS-9129.007.patch, > HDFS-9129.008.patch, HDFS-9129.009.patch, HDFS-9129.010.patch > > > The {{SafeMode}} needs to track whether there are enough blocks so that the > NN can get out of the safemode. These fields can moved to the > {{BlockManager}} class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)