[ 
https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingliang Liu updated HDFS-9129:
--------------------------------
    Attachment: HDFS-9129.010.patch

Thank you [~wheat9] for your review. The v10 patch addresses the comments, 
along with rebasing from {{trunk}} branch. As there are conflicts because of 
[HDFS-4015], I also invited [~arpitagarwal] and [~anu] to review the patch.

See response inline to [~wheat9]'s comments.
{quote}
1. It does not give much information compared to figuring out the issues on the 
code directly. What does "thresholds met" / "extensions reached" mean? It 
causes more confusions than explanations.
{quote}
The motivation is that diagram is helpful for glimpse, if we can provide 
definition of "thresholds met" / "extension reached". In v10 patch, I add more 
explanations in the comments.

{quote}
{code}
LOG.error("Non-recognized block manager safe mode status: {}", status);
{code}
2. Should be an assert.
{quote}
Truely, I'll simply {{assert false : "some comment"}}.

{quote}
{code}
private volatile boolean isInManualSafeMode = false;
private volatile boolean isInResourceLowSafeMode = false;

...
isInManualSafeMode = !resourcesLow;
isInResourceLowSafeMode = resourcesLow;
{code}
3. How do these two variables synchronize? Is the system in consistent state in 
the middle of the execution?
{quote}
Good question. Actually it's not in consistent state in the middle of the 
execution. If the {{resourceLow}} is true, and before that name node is in 
manual safe mode, in the middle of the execution {{isInSafeMode}} will return 
false, which means the safe mode is OFF. The main reason is that writing to the 
two variables (aka enter/leave safemode) is guarded by the FSNamesystem write 
lock, while read is not.

The enum type state was replace with two boolean flags in the v7 patch, because 
the two-lay state machine was cumbersome per offline discussion with [~wheat9] 
and [~jingzhao]. Guarding all the read looks expensive. Bitwise operation on a 
flag variable seems tricky.

The new design goes back to the {{trunk}} logic, which makes the block manager 
stays in safe mode in the middle of the execution:
{code}
      if (resourcesLow) {
        isInResourceLowSafeMode = true;
      } else {
        isInManualSafeMode = true;
      }
{code}
In case both {{isInManualSafeMode}} and {{isInResourceLowSafeMode}} are true, 
{{isInResourceLowSafeMode}} will short-circuit {{isInManualSafeMode}}, 
according to our current logic, e.g. {{getTurnOffTip()}}.

{quote}
{code}
+    // INITIALIZED -> THRESHOLD
+    bmSafeMode.setBlockTotal(BLOCK_TOTAL);
+    assertEquals(BMSafeModeStatus.THRESHOLD, getSafeModeStatus());
+    assertTrue(bmSafeMode.isInSafeMode());
{code}
4. It makes sense to put it in a test instead of in the @Before method.
{quote}
That makes sense to me. I'll add a new test called {{testSetBlockTotal}}.

{quote}
{code}
+    // EXTENSION -> OFF
+    Whitebox.setInternalState(bmSafeMode, "status", 
BMSafeModeStatus.EXTENSION);
+    reachBlockThreshold();
+    reachExtension();
+    bmSafeMode.checkSafeMode();
+    assertEquals(BMSafeModeStatus.OFF, getSafeModeStatus());
{code}
5. Please refactor the code – you can reuse the getSafeModeStatus() that is 
defined by the class.
{quote}
Actually the first statment for each state transition tests is to *set* the 
internal state.
I'll make the fields package accessible so the Whitebox is not needed. See the 
end of this comment.

{quote}
{code}
assertEquals(getBlockSafe(), i);
{code}
6. The expected value should go first according to the function signature.
{quote}
Nice catch. I will revise the whole test to fix all similar ones.

{quote}
7. A higher level question is that why it needs so many getInternalState() 
statements? It looks to me that many of these behaviors can be observed outside 
without whitebox testing.
{quote}
The reason was that *all* non-static fields in {{BlockManagerSafeMode}} is 
private. By design, I made them _private_ because we don't need access to its 
internal state in {{BlockManager}}.

In the new design, the v10 patch simply makes the fields package private so the 
test will be more straight-forward.

> Move the safemode block count into BlockManager
> -----------------------------------------------
>
>                 Key: HDFS-9129
>                 URL: https://issues.apache.org/jira/browse/HDFS-9129
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Haohui Mai
>            Assignee: Mingliang Liu
>         Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, 
> HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch, 
> HDFS-9129.005.patch, HDFS-9129.006.patch, HDFS-9129.007.patch, 
> HDFS-9129.008.patch, HDFS-9129.009.patch, HDFS-9129.010.patch
>
>
> The {{SafeMode}} needs to track whether there are enough blocks so that the 
> NN can get out of the safemode. These fields can moved to the 
> {{BlockManager}} class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to