Zephyr Guo created HBASE-20157:
----------------------------------

             Summary: WAL file might get broken
                 Key: HBASE-20157
                 URL: https://issues.apache.org/jira/browse/HBASE-20157
             Project: HBase
          Issue Type: Bug
          Components: wal
    Affects Versions: 1.1.0
            Reporter: Zephyr Guo
            Assignee: Zephyr Guo
             Fix For: 2.0.0


WAL file can get corrupted by HBASE-16824. 
When calling Writer.close() and Writer.sync() in the same time, a HDFS 
bug(HDFS-13243) will be triggered. And, if this did happen, the last block in 
WAL will get broken(NN mark it as CorruptBlock).

My purpose of reporting this scenario here is to help those who come across the 
same problem like me. (HBASE-16824 has been fixed, though) 


{panel:title=RS log}


2018-02-05 07:58:54,212 INFO 
[regionserver/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218:16020.logRoller]
 hdfs.DFSClient: Could not complete 
/hbase/WALs/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com,16020,1517453470107/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com%2C16020%2C1517453470107.default.1517788719683
 retrying...
2018-02-05 07:59:00,612 INFO 
[regionserver/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218:16020.logRoller]
 hdfs.DFSClient: Could not complete 
/hbase/WALs/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com,16020,1517453470107/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com%2C16020%2C1517453470107.default.1517788719683
 retrying...
{panel}
{panel:title=NN log}


2018-02-05 07:58:48,011 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: 
/hbase/WALs/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com,16020,1517453470107/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com%2C16020%2C1517453470107.default.1517788719683
 for DFSClient_NONMAPREDUCE_1109936977_1
2018-02-05 07:58:48,011 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
blk_1080650145_6909339\{UCState=COMMITTED, truncateBlock=null, 
primaryNodeIndex=-1, 
replicas=[ReplicaUC[[DISK]DS-a4e579e7-4721-4c22-9b61-f1d00b33c45f:NORMAL:10.0.0.218:50010|RBW],
 
ReplicaUC[[DISK]DS-5d3d7878-876d-4a5a-97bc-5535c4cf8d59:NORMAL:10.0.0.220:50010|RBW],
 
ReplicaUC[[DISK]DS-ccc314b2-e2ad-4c1f-99a5-a39e3677a83b:NORMAL:10.0.0.221:50010|RBW]]}
 is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in file 
/hbase/WALs/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com,16020,1517453470107/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com%2C16020%2C1517453470107.default.1517788719683
2018-02-05 07:58:48,111 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1080650145 added as corrupt on 
10.0.0.221:50010 by hb-j5e517al6xib80rkb-005.hbase.rds.aliyuncs.com/10.0.0.221 
because block is COMMITTED and reported length 1957330 does not match length in 
block map 80594
2018-02-05 07:58:48,224 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1080650145 added as corrupt on 
10.0.0.218:50010 by hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 
because block is COMMITTED and reported length 1957330 does not match length in 
block map 80594
2018-02-05 07:58:48,224 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1080650145 added as corrupt on 
10.0.0.220:50010 by hb-j5e517al6xib80rkb-003.hbase.rds.aliyuncs.com/10.0.0.220 
because block is COMMITTED and reported length 1957330 does not match length in 
block map 80594
2018-02-05 07:58:48,511 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
blk_1080650145_6909339\{UCState=COMMITTED, truncateBlock=null, 
primaryNodeIndex=-1, 
replicas=[ReplicaUC[[DISK]DS-a4e579e7-4721-4c22-9b61-f1d00b33c45f:NORMAL:10.0.0.218:50010|RBW],
 
ReplicaUC[[DISK]DS-5d3d7878-876d-4a5a-97bc-5535c4cf8d59:NORMAL:10.0.0.220:50010|RBW],
 
ReplicaUC[[DISK]DS-ccc314b2-e2ad-4c1f-99a5-a39e3677a83b:NORMAL:10.0.0.221:50010|RBW]]}
 is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in file 
/hbase/WALs/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com,16020,1517453470107/hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com%2C16020%2C1517453470107.default.1517788719683
{panel}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to