[jira] [Comment Edited] (HDFS-12618) fsck -includeSnapshots reports wrong amount of total blocks

Xiao Chen (JIRA) Fri, 29 Dec 2017 13:40:29 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-12618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16306556#comment-16306556
 ]


Xiao Chen edited comment on HDFS-12618 at 12/29/17 9:39 PM:
------------------------------------------------------------

Sorry for my month-long delay on reviewing, finally locked myself to the chair 
and reviewed the latest patch and comments before the end of the year. 

Good to see we're improving, and happy to see the many added test cases. Thanks 
for the continued work [~wchevreuil], and [~daryn] for the reviews.

The problem is pretty hard, but direction looks good. Some comments on patch  4:
- {{validate()}} then catch {{AssertionError}} should be changed, for the 
reasons Daryn mentioned, plus the fact that assertion could be disabled at run 
time. See 
https://docs.oracle.com/javase/8/docs/technotes/guides/language/assert.html#enable-disable
 . 
- I'm not sure the current {{getLastINode()==null}} check is enough for 
{{INodeReference}}s. What if the block changed in the middle of the snapshots? 
For example, say file 1 has block 1&2. Then the following happened: snapshot 
s1, truncate so file has block 1-only, snapshot s2, append so file has block 
1&3, snapshot s3. Would we be able to tell the difference when {{fsck 
-includeSnapshots}} now?
- Because locks are reacquired during fsck, it's theoretically possible that 
snapshots are created / deleted during the scan. I think current behavior is 
we're not aware of new snapshots, and skip the deleted snapshots (since 
{{snapshottableDirs}} is populated before the {{check}} call. Possible to add a 
fault-injected test to make sure we don't NPE on deleted snapshots?
- Speechlessly {{NamenodeFsck}} also has other block counts like 
{{numMinReplicatedBlocks}}. Current code only takes care of total blocks, which 
IMO is the most important. This also seems to be the goal of this jira as 
suggested by the title and description, so Okay to split that to another jira.

Trivial ones:
- I see the variable name of {{checkDir}} is changed to {{filePath}}, which is 
not accurate. Prefer to keep the old name {{path}}.
- {{checkFilesInSnapshotOnly}}: suggest to handle {{inode==null}} in it's own 
block, so we don't have to worry about that for non {{INodeFile}} code paths. 
(FYI null is not instanceof anything, so patch 4 code didn't have to check. 
Need to be careful after changing to {{isFile}}, as (correctly) suggested by 
Daryn.)
- {{lastSnapshotId = -1}} should use {{Snapshot.NO_SNAPSHOT_ID}} rather than -1.
- {{inodeFile..getFileWithSnapshotFeature().getDiffs()}} cannot never null 
judging from {{FileWithSnapshotFeature}}, so no need for nullity check
- Please format the code you changed. There are many space inconsistencies 
around brackets.
- Test should add timeouts. Perhaps better to just use a {{Rule}} on the class, 
to safeguard cases by default with something like 3 minutes.
- Feels to me the "HEALTHY" check in the beginning of each test case is not 
necessary.
- Could use {{GenericTestUtils.waitFor()}} for the waits.
- Optional - {{TestFsck}} is already 2.4k+ lines long. Maybe better to create a 
new test class for snapshot blockcount specifically. In that class the name of 
each test would be shorter and more readable.

Also a curiosity question to [~daryn]:
bq. try - lock - finally unlock v.s. lock - try - finally unlock
Understood and completely agree with the advice. Curiosity comes from: in 
current HDFS, it looks like only {{FSN#writeLockInterruptibly}} is possible to 
throw a checked exception. Of course there could also be unchecked exceptions - 
is this a coding advice or something you have run into in practice? Care to 
share the fun details? :)


was (Author: xiaochen):
Sorry for my month-long delay on reviewing, finally locked myself to the chair 
and reviewed the latest patch and comments before the end of the year. 

Good to see we're improving, and happy to see the many added test cases. Thanks 
for the continued work [~wchevreuil], and [~daryn] for the reviews.

The problem is pretty hard, but direction looks good. Some comments on patch  4:
- {{validate()}} then catch {{AssertionError}} should be changed, for the 
reasons Daryn mentioned, plus the fact that assertion could be disabled at run 
time. See 
https://docs.oracle.com/javase/8/docs/technotes/guides/language/assert.html#enable-disable
 . 
- I'm not sure the current {{getLastINode()==null}} check is enough for 
{{INodeReference}}s. What if the block changed in the middle of the snapshots? 
For example, say file 1 has block 1&2. Then the following happened: snapshot 
s1, truncate so file has block 1-only, snapshot s2, append so file has block 
1&3, snapshot s3. Would we be able to tell the difference when {{fsck 
-includeSnapshots}} now?
- Because locks are reacquired during fsck, it's theoretically possible that 
snapshots are created / deleted during the scan. I think current behavior is 
we're not aware of new snapshots, and skip the deleted snapshots (since 
{{snapshottableDirs}} is populated before the {{check}} call. Possible to add a 
fault-injected test to make sure we don't NPE on deleted snapshots?
- Speechlessly {{NamenodeFsck}} also has other block counts like 
{{numMinReplicatedBlocks}}. Current code only takes care of total blocks, which 
IMO is the most important. This also seems to be the goal of this jira as 
suggested by the title and description, so Okay to split that to another jira.

Trivial ones:
- I see the variable name of {{checkDir}} is changed to {{filePath}}, which is 
not accurate. Prefer to keep the old name {{path}}.
- {{checkFilesInSnapshotOnly}}: suggest to handle {{inode==null}} in it's own 
block, so we don't have to worry about that for non {{INodeFile}} code paths. 
(FYI null is not instanceof anything, so patch 4 code didn't have to check. 
Need to be careful after changing to {{isFile}}, as (correctly) suggested by 
Daryn.)
- {{lastSnapshotId = -1}} should use {{Snapshot.NO_SNAPSHOT_ID}} rather than -1.
- {{inodeFile..getFileWithSnapshotFeature().getDiffs()}} cannot never null 
judging from {{FileWithSnapshotFeature}}, so no need for nullity check
- Test should add timeouts. Perhaps better to just use a {{Rule}} on the class, 
to safeguard cases by default with something like 3 minutes.
- Feels to me the "HEALTHY" check in the beginning of each test case is not 
necessary.
- Could use {{GenericTestUtils.waitFor()}} for the waits.
- Optional - {{TestFsck}} is already 2.4k+ lines long. Maybe better to create a 
new test class for snapshot blockcount specifically. In that class the name of 
each test would be shorter and more readable.

Also a curiosity question to [~daryn]:
bq. try - lock - finally unlock v.s. lock - try - finally unlock
Understood and completely agree with the advice. Curiosity comes from: in 
current HDFS, it looks like only {{FSN#writeLockInterruptibly}} is possible to 
throw a checked exception. Of course there could also be unchecked exceptions - 
is this a coding advice or something you have run into in practice? Care to 
share the fun details? :)

> fsck -includeSnapshots reports wrong amount of total blocks
> -----------------------------------------------------------
>
>                 Key: HDFS-12618
>                 URL: https://issues.apache.org/jira/browse/HDFS-12618
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 3.0.0-alpha3
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Minor
>         Attachments: HDFS-121618.initial, HDFS-12618.001.patch, 
> HDFS-12618.002.patch, HDFS-12618.003.patch, HDFS-12618.004.patch
>
>
> When snapshot is enabled, if a file is deleted but is contained by a 
> snapshot, *fsck* will not reported blocks for such file, showing different 
> number of *total blocks* than what is exposed in the Web UI. 
> This should be fine, as *fsck* provides *-includeSnapshots* option. The 
> problem is that *-includeSnapshots* option causes *fsck* to count blocks for 
> every occurrence of a file on snapshots, which is wrong because these blocks 
> should be counted only once (for instance, if a 100MB file is present on 3 
> snapshots, it would still map to one block only in hdfs). This causes fsck to 
> report much more blocks than what actually exist in hdfs and is reported in 
> the Web UI.
> Here's an example:
> 1) HDFS has two files of 2 blocks each:
> {noformat}
> $ hdfs dfs -ls -R /
> drwxr-xr-x   - root supergroup          0 2017-10-07 21:21 /snap-test
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:16 /snap-test/file1
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:17 /snap-test/file2
> drwxr-xr-x   - root supergroup          0 2017-05-13 13:03 /test
> {noformat} 
> 2) There are two snapshots, with the two files present on each of the 
> snapshots:
> {noformat}
> $ hdfs dfs -ls -R /snap-test/.snapshot
> drwxr-xr-x   - root supergroup          0 2017-10-07 21:21 
> /snap-test/.snapshot/snap1
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:16 
> /snap-test/.snapshot/snap1/file1
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:17 
> /snap-test/.snapshot/snap1/file2
> drwxr-xr-x   - root supergroup          0 2017-10-07 21:21 
> /snap-test/.snapshot/snap2
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:16 
> /snap-test/.snapshot/snap2/file1
> -rw-r--r--   1 root supergroup  209715200 2017-10-07 20:17 
> /snap-test/.snapshot/snap2/file2
> {noformat}
> 3) *fsck -includeSnapshots* reports 12 blocks in total (4 blocks for the 
> normal file path, plus 4 blocks for each snapshot path):
> {noformat}
> $ hdfs fsck / -includeSnapshots
> FSCK started by root (auth:SIMPLE) from /127.0.0.1 for path / at Mon Oct 09 
> 15:15:36 BST 2017
> Status: HEALTHY
>  Number of data-nodes:        1
>  Number of racks:             1
>  Total dirs:                  6
>  Total symlinks:              0
> Replicated Blocks:
>  Total size:  1258291200 B
>  Total files: 6
>  Total blocks (validated):    12 (avg. block size 104857600 B)
>  Minimally replicated blocks: 12 (100.0 %)
>  Over-replicated blocks:      0 (0.0 %)
>  Under-replicated blocks:     0 (0.0 %)
>  Mis-replicated blocks:               0 (0.0 %)
>  Default replication factor:  1
>  Average block replication:   1.0
>  Missing blocks:              0
>  Corrupt blocks:              0
>  Missing replicas:            0 (0.0 %)
> {noformat}
> 4) Web UI shows the correct number (4 blocks only):
> {noformat}
> Security is off.
> Safemode is off.
> 5 files and directories, 4 blocks = 9 total filesystem object(s).
> {noformat}
> I would like to work on this solution, will propose an initial solution 
> shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-12618) fsck -includeSnapshots reports wrong amount of total blocks

Reply via email to