[ 
https://issues.apache.org/jira/browse/HDFS-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315879#comment-17315879
 ] 

Tak-Lon (Stephen) Wu commented on HDFS-7924:
--------------------------------------------

to extend my comment above, will HDFS namenode copy the blocks of a open (small 
with 90 bytes in-memory of datanode) file when a datanode is draining ? and if 
this is the only copy of the block, if the block is not going to be replicated, 
is this a data loss issue ? 

e.g. we have a file has been open but not closed, but it's very tiny, and see 
the investigation below. (FYI hbase is using {{hsync}} as default )
 * we found that from HDFS namenode understanding, the open file has {{0}} byte.

{code:java}
[hadoop@ip-10-233-6-226 ~]$ hadoop fs -ls /user/hbase/WAL/WALs/
Found 2 items
drwxr-xr-x   - hbase hbase          0 2021-04-06 22:18 
/user/hbase/WAL/WALs/ip-10-233-13-240.ec2.internal,16020,1617388713303
drwxr-xr-x   - hbase hbase          0 2021-04-06 21:38 
/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545
[hadoop@ip-10-233-6-226 ~]$ hadoop fs -ls -R /user/hbase/WAL/WALs/*
-rw-r--r--   2 hbase hbase          0 2021-04-06 21:38 
/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347
{code}
 * we confirm from the {{fsck}} level reading, the file 
{{/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347}}
 has {{0}} length.

 
{code:java}
[hadoop@ip-10-233-6-226 ~]$ hdfs fsck 
/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347
Connecting to namenode via 
http://ip-10-233-6-226.ec2.internal:9870/fsck?ugi=hadoop&path=%2Fuser%2Fhbase%2FWAL%2FWALs%2Fip-10-233-13-79.ec2.internal%2C16020%2C1617388712545%2Fip-10-233-13-79.ec2.internal%252C16020%252C1617388712545.1617745133347
FSCK started by hadoop (auth:SIMPLE) from /10.233.6.226 for path 
/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347
 at Tue Apr 06 22:29:33 UTC 2021

Status: HEALTHY
 Number of data-nodes:    2
 Number of racks:        1
 Total dirs:            0
 Total symlinks:        0

Replicated Blocks:
 Total size:    0 B
 Total files:    0 (Files currently being written: 1)
 Total blocks (validated):    0 (Total open file blocks (not validated): 1)
 Minimally replicated blocks:    0
 Over-replicated blocks:    0
 Under-replicated blocks:    0
 Mis-replicated blocks:        0
 Default replication factor:    2
 Average block replication:    0.0
 Missing blocks:        0
 Corrupt blocks:        0
 Missing replicas:        0

Erasure Coded Block Groups:
 Total size:    0 B
 Total files:    0
 Total block groups (validated):    0
 Minimally erasure-coded block groups:    0
 Over-erasure-coded block groups:    0
 Under-erasure-coded block groups:    0
 Unsatisfactory placement block groups:    0
 Average block group size:    0.0
 Missing block groups:        0
 Corrupt block groups:        0
 Missing internal blocks:    0
FSCK ended at Tue Apr 06 22:29:33 UTC 2021 in 2 milliseconds


The filesystem under path 
'/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347'
 is HEALTHY
{code}
 * if we do a cat on this open file, we can find the WAL header of it.

 
{code:java}
[hadoop@ip-10-233-6-226 ~]$ hadoop fs -cat 
/user/hbase/WAL/WALs/ip-10-233-13-240.ec2.internal,16020,1617388713303/ip-10-233-13-240.ec2.internal%2C16020%2C1617388713303.1617747490834
PWAL"ProtobufLogWriter*<org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
{code}
 * also if we download it to local disk, it's showing a non-zero file with 90 
bytes.

{code:java}
[hadoop@ip-10-233-6-226 ~]$ hadoop fs -copyToLocal 
/user/hbase/WAL/WALs/ip-10-233-13-240.ec2.internal,16020,1617388713303/ip-10-233-13-240.ec2.internal%2C16020%2C1617388713303.1617747490834
 .
[hadoop@ip-10-233-6-226 ~]$ ls -l
-rw-r--r-- 1 hadoop hadoop      90 Apr  6 22:33 
ip-10-233-13-240.ec2.internal%2C16020%2C1617388713303.1617747490834}}
{code}
 * ls again after download, it still showing as zero.

{code:java}
[hadoop@ip-10-233-6-226 ~]$ hadoop fs -ls -R /user/hbase/WAL/WALs/*
-rw-r--r--   2 hbase hbase          0 2021-04-06 22:18 
/user/hbase/WAL/WALs/ip-10-233-13-240.ec2.internal,16020,1617388713303/ip-10-233-13-240.ec2.internal%2C16020%2C1617388713303.1617747490834
-rw-r--r--   2 hbase hbase          0 2021-04-06 22:18 
/user/hbase/WAL/WALs/ip-10-233-13-240.ec2.internal,16020,1617388713303/ip-10-233-13-240.ec2.internal%2C16020%2C1617388713303.meta.1617747490920.meta
-rw-r--r--   2 hbase hbase          0 2021-04-06 21:38 
/user/hbase/WAL/WALs/ip-10-233-13-79.ec2.internal,16020,1617388712545/ip-10-233-13-79.ec2.internal%2C16020%2C1617388712545.1617745133347

{code}

> NameNode goes into infinite lease recovery
> ------------------------------------------
>
>                 Key: HDFS-7924
>                 URL: https://issues.apache.org/jira/browse/HDFS-7924
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client, namenode
>    Affects Versions: 2.6.0
>            Reporter: Arpit Agarwal
>            Assignee: Yi Liu
>            Priority: Major
>
> We encountered an HDFS lease recovery issue. All DataNodes+NameNodes were 
> restarted while a client was running. A block was created on the NN but it 
> had not yet been created on DNs. The NN tried to recover the lease for the 
> block on restart but was unable to do so getting into an infinite loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to