[
https://issues.apache.org/jira/browse/HDFS-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315879#comment-17315879
]
Tak-Lon (Stephen) Wu commented on HDFS-7924:
--------------------------------------------
to extend my comment above, will HDFS namenode copy the blocks of a open (small
with 90 bytes in-memory of datanode) file when a datanode is draining ? and if
this is the only copy of the block, if the block is not going to be replicated,
is this a data loss issue ?
e.g. we have a file has been open but not closed, but it's very tiny, and see
the investigation below. (FYI hbase is using {{hsync}} as default )
* we found that from HDFS namenode understanding, the open file has {{0}} byte.
{code:java}
[hadoop@ip-10-233-6-226 ~]$ hadoop fs -ls /user/hbase/WAL/WALs/
Found 2 items
drwxr-xr-x - hbase hbase 0 2021-04-06 22:18
/user/hbase/WAL/WALs/ip-10-233-13-240.ec2.internal,16020,1617388713303
drwxr-xr-x - hbase hbase 0 2021-04-06 21:38
/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545
[hadoop@ip-10-233-6-226 ~]$ hadoop fs -ls -R /user/hbase/WAL/WALs/*
-rw-r--r-- 2 hbase hbase 0 2021-04-06 21:38
/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347
{code}
* we confirm from the {{fsck}} level reading, the file
{{/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347}}
has {{0}} length.
{code:java}
[hadoop@ip-10-233-6-226 ~]$ hdfs fsck
/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347
Connecting to namenode via
http://ip-10-233-6-226.ec2.internal:9870/fsck?ugi=hadoop&path=%2Fuser%2Fhbase%2FWAL%2FWALs%2Fip-10-233-13-79.ec2.internal%2C16020%2C1617388712545%2Fip-10-233-13-79.ec2.internal%252C16020%252C1617388712545.1617745133347
FSCK started by hadoop (auth:SIMPLE) from /10.233.6.226 for path
/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347
at Tue Apr 06 22:29:33 UTC 2021
Status: HEALTHY
Number of data-nodes: 2
Number of racks: 1
Total dirs: 0
Total symlinks: 0
Replicated Blocks:
Total size: 0 B
Total files: 0 (Files currently being written: 1)
Total blocks (validated): 0 (Total open file blocks (not validated): 1)
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 2
Average block replication: 0.0
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
FSCK ended at Tue Apr 06 22:29:33 UTC 2021 in 2 milliseconds
The filesystem under path
'/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347'
is HEALTHY
{code}
* if we do a cat on this open file, we can find the WAL header of it.
{code:java}
[hadoop@ip-10-233-6-226 ~]$ hadoop fs -cat
/user/hbase/WAL/WALs/ip-10-233-13-240.ec2.internal,16020,1617388713303/ip-10-233-13-240.ec2.internal%2C16020%2C1617388713303.1617747490834
PWAL"ProtobufLogWriter*<org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
{code}
* also if we download it to local disk, it's showing a non-zero file with 90
bytes.
{code:java}
[hadoop@ip-10-233-6-226 ~]$ hadoop fs -copyToLocal
/user/hbase/WAL/WALs/ip-10-233-13-240.ec2.internal,16020,1617388713303/ip-10-233-13-240.ec2.internal%2C16020%2C1617388713303.1617747490834
.
[hadoop@ip-10-233-6-226 ~]$ ls -l
-rw-r--r-- 1 hadoop hadoop 90 Apr 6 22:33
ip-10-233-13-240.ec2.internal%2C16020%2C1617388713303.1617747490834}}
{code}
* ls again after download, it still showing as zero.
{code:java}
[hadoop@ip-10-233-6-226 ~]$ hadoop fs -ls -R /user/hbase/WAL/WALs/*
-rw-r--r-- 2 hbase hbase 0 2021-04-06 22:18
/user/hbase/WAL/WALs/ip-10-233-13-240.ec2.internal,16020,1617388713303/ip-10-233-13-240.ec2.internal%2C16020%2C1617388713303.1617747490834
-rw-r--r-- 2 hbase hbase 0 2021-04-06 22:18
/user/hbase/WAL/WALs/ip-10-233-13-240.ec2.internal,16020,1617388713303/ip-10-233-13-240.ec2.internal%2C16020%2C1617388713303.meta.1617747490920.meta
-rw-r--r-- 2 hbase hbase 0 2021-04-06 21:38
/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347
{code}
> NameNode goes into infinite lease recovery
> ------------------------------------------
>
> Key: HDFS-7924
> URL: https://issues.apache.org/jira/browse/HDFS-7924
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client, namenode
> Affects Versions: 2.6.0
> Reporter: Arpit Agarwal
> Assignee: Yi Liu
> Priority: Major
>
> We encountered an HDFS lease recovery issue. All DataNodes+NameNodes were
> restarted while a client was running. A block was created on the NN but it
> had not yet been created on DNs. The NN tried to recover the lease for the
> block on restart but was unable to do so getting into an infinite loop.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]