[
https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mingliang Liu updated HDFS-9129:
--------------------------------
Attachment: HDFS-9129.010.patch
Thank you [~wheat9] for your review. The v10 patch addresses the comments,
along with rebasing from {{trunk}} branch. As there are conflicts because of
[HDFS-4015], I also invited [~arpitagarwal] and [~anu] to review the patch.
See response inline to [~wheat9]'s comments.
{quote}
1. It does not give much information compared to figuring out the issues on the
code directly. What does "thresholds met" / "extensions reached" mean? It
causes more confusions than explanations.
{quote}
The motivation is that diagram is helpful for glimpse, if we can provide
definition of "thresholds met" / "extension reached". In v10 patch, I add more
explanations in the comments.
{quote}
{code}
LOG.error("Non-recognized block manager safe mode status: {}", status);
{code}
2. Should be an assert.
{quote}
Truely, I'll simply {{assert false : "some comment"}}.
{quote}
{code}
private volatile boolean isInManualSafeMode = false;
private volatile boolean isInResourceLowSafeMode = false;
...
isInManualSafeMode = !resourcesLow;
isInResourceLowSafeMode = resourcesLow;
{code}
3. How do these two variables synchronize? Is the system in consistent state in
the middle of the execution?
{quote}
Good question. Actually it's not in consistent state in the middle of the
execution. If the {{resourceLow}} is true, and before that name node is in
manual safe mode, in the middle of the execution {{isInSafeMode}} will return
false, which means the safe mode is OFF. The main reason is that writing to the
two variables (aka enter/leave safemode) is guarded by the FSNamesystem write
lock, while read is not.
The enum type state was replace with two boolean flags in the v7 patch, because
the two-lay state machine was cumbersome per offline discussion with [~wheat9]
and [~jingzhao]. Guarding all the read looks expensive. Bitwise operation on a
flag variable seems tricky.
The new design goes back to the {{trunk}} logic, which makes the block manager
stays in safe mode in the middle of the execution:
{code}
if (resourcesLow) {
isInResourceLowSafeMode = true;
} else {
isInManualSafeMode = true;
}
{code}
In case both {{isInManualSafeMode}} and {{isInResourceLowSafeMode}} are true,
{{isInResourceLowSafeMode}} will short-circuit {{isInManualSafeMode}},
according to our current logic, e.g. {{getTurnOffTip()}}.
{quote}
{code}
+ // INITIALIZED -> THRESHOLD
+ bmSafeMode.setBlockTotal(BLOCK_TOTAL);
+ assertEquals(BMSafeModeStatus.THRESHOLD, getSafeModeStatus());
+ assertTrue(bmSafeMode.isInSafeMode());
{code}
4. It makes sense to put it in a test instead of in the @Before method.
{quote}
That makes sense to me. I'll add a new test called {{testSetBlockTotal}}.
{quote}
{code}
+ // EXTENSION -> OFF
+ Whitebox.setInternalState(bmSafeMode, "status",
BMSafeModeStatus.EXTENSION);
+ reachBlockThreshold();
+ reachExtension();
+ bmSafeMode.checkSafeMode();
+ assertEquals(BMSafeModeStatus.OFF, getSafeModeStatus());
{code}
5. Please refactor the code – you can reuse the getSafeModeStatus() that is
defined by the class.
{quote}
Actually the first statment for each state transition tests is to *set* the
internal state.
I'll make the fields package accessible so the Whitebox is not needed. See the
end of this comment.
{quote}
{code}
assertEquals(getBlockSafe(), i);
{code}
6. The expected value should go first according to the function signature.
{quote}
Nice catch. I will revise the whole test to fix all similar ones.
{quote}
7. A higher level question is that why it needs so many getInternalState()
statements? It looks to me that many of these behaviors can be observed outside
without whitebox testing.
{quote}
The reason was that *all* non-static fields in {{BlockManagerSafeMode}} is
private. By design, I made them _private_ because we don't need access to its
internal state in {{BlockManager}}.
In the new design, the v10 patch simply makes the fields package private so the
test will be more straight-forward.
> Move the safemode block count into BlockManager
> -----------------------------------------------
>
> Key: HDFS-9129
> URL: https://issues.apache.org/jira/browse/HDFS-9129
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Haohui Mai
> Assignee: Mingliang Liu
> Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch,
> HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch,
> HDFS-9129.005.patch, HDFS-9129.006.patch, HDFS-9129.007.patch,
> HDFS-9129.008.patch, HDFS-9129.009.patch, HDFS-9129.010.patch
>
>
> The {{SafeMode}} needs to track whether there are enough blocks so that the
> NN can get out of the safemode. These fields can moved to the
> {{BlockManager}} class.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)