[ https://issues.apache.org/jira/browse/HBASE-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15112009#comment-15112009 ]
Hadoop QA commented on HBASE-15019: ----------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | {color:red} docker {color} | {color:red} 18m 23s {color} | {color:red} Docker failed to build yetus/hbase:date2016-01-22. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12783597/HBASE-15019-v4.patch | | JIRA Issue | HBASE-15019 | | Powered by | Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/250/console | This message was automatically generated. > Replication stuck when HDFS is restarted > ---------------------------------------- > > Key: HBASE-15019 > URL: https://issues.apache.org/jira/browse/HBASE-15019 > Project: HBase > Issue Type: Bug > Components: Replication, wal > Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.0.3, 0.98.16.1 > Reporter: Matteo Bertozzi > Assignee: Matteo Bertozzi > Fix For: 2.0.0, 1.2.0, 1.3.0 > > Attachments: HBASE-15019-v0_branch-1.2.patch, HBASE-15019-v1.patch, > HBASE-15019-v1_0.98.patch, HBASE-15019-v1_branch-1.2.patch, > HBASE-15019-v2.patch, HBASE-15019-v3.patch, HBASE-15019-v4.patch > > > RS is normally working and writing on the WAL. > HDFS is killed and restarted, and the RS try to do a roll. > The close fail, but the roll succeed (because hdfs is now up) and everything > works. > {noformat} > 2015-12-11 21:52:28,058 ERROR > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException > while writing trailer > java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting... > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496) > 2015-12-11 21:52:28,059 ERROR > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer > java.io.IOException: All datanodes 10.51.30.152:50010 are bad. Aborting... > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1147) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496) > 2015-12-11 21:52:28,059 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: > Riding over HLog close failure! error count=1 > {noformat} > The problem is on the replication side. that log we rolled and we were not > able to close > is waiting for a lease recovery. > {noformat} > 2015-12-11 21:16:31,909 ERROR > org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Can't open after 267 > attempts and 301124ms > {noformat} > the WALFactory notify us about that, but there is nothing on the RS side that > perform the WAL recovery. > {noformat} > 2015-12-11 21:11:30,921 WARN > org.apache.hadoop.hbase.regionserver.wal.HLogFactory: Lease should have > recovered. This is not expected. Will retry > java.io.IOException: Cannot obtain block length for > LocatedBlock{BP-1547065147-10.51.30.152-1446756937665:blk_1073801614_61243; > getBlockSize()=83; corrupt=false; offset=0; locs=[10.51.30.154:50010, > 10.51.30.152:50010, 10.51.30.155:50010]} > at > org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:358) > at > org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:300) > at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237) > at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:230) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297) > at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:161) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766) > at > org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:116) > at > org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89) > at > org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:508) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:321) > {noformat} > the only way to trigger a WAL recovery is to restart and force the master to > trigger the lease recovery on WAL split. > but there is a case where restarting will not help. If the RS keeps going > rolling and flushing the unclosed WAL will be moved in the archive, and at > that point the master will never try to do a lease recovery on it. > since we know that the RS is still going, should we try to recover the lease > on the RS side? > is it better/safer to trigger an abort on the RS, so we have only the master > doing lease recovery? -- This message was sent by Atlassian JIRA (v6.3.4#6332)