[
https://issues.apache.org/jira/browse/HBASE-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dave Latham updated HBASE-8096:
-------------------------------
Attachment: HBASE-8096.0.94.patch
I was able to regularly reproduce this problem with a few steps:
- Revert HBASE-7530 (i.e. put
backconf1.setInt("hbase.regionserver.hlog.blocksize", 1024*20) )
- Introduce a Thread.sleep(500) at the beginning of DataXceiver.writeBlock
- Running TestReplicationSmallTests.loadTesting (more likely to occur if you
change NB_ROWS_IN_BIG_BATCH to something larger)
Watching the test output brought some more insight. In production we only saw
this happening on region servers when they were reusing an existing reader (on
a new data block). In the test I could see the NPE logged in the case of a new
reader being used, but it was caught and wrapped in an IOException by
HLog.getReader so in that case it was caught and retried by the existing logic
in ReplicationSource.openReader and ReplicationSource.run.
I'm attaching a patch for 0.94 which solves the problem for the test case I
described. It updates ReplicationHLogReaderManager.openReader to catch the NPE
wrap it in an IOException so that both cases react the same way. Then when an
IOException is caught in ReplicationSource.openReader it checks to see if the
cause is a NPE and allows ReplicationSource.run to retry the file.
It would be great if one or two people could give it a look, especially
[~jdcryans] if you have a moment.
I'll look at a patch for 0.95 and trunk.
> [replication] NPE while replicating a log that is acquiring a new block from
> HDFS
> ----------------------------------------------------------------------------------
>
> Key: HBASE-8096
> URL: https://issues.apache.org/jira/browse/HBASE-8096
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.94.5
> Reporter: Ian Friedman
> Attachments: HBASE-8096.0.94.patch
>
>
> We're getting an NPE during replication, which causes replication for that
> RegionServer to stop until we restart it.
> {noformat}
> 2013-03-10 12:49:12,679 ERROR
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Unexpected exception in ReplicationSource,
> currentPath=hdfs://hmaster1:9000/hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.updateBlockInfo(DFSClient.java:1882)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1855)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1831)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
> at
> org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
> at
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1495)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1482)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:308)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:69)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:505)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
> {noformat}
> Some extra digging into the DataNode and NameNode logs makes this seem
> related to HBASE-7530 and HDFS-4380
> Here's the relevant snipped portions of the RS, DN, and NN logs:
> {noformat}
> RS 2013-03-10 12:49:12,618 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> Going to report log #hslave1177%2C60020%2C1362549511446.1362944946489 for
> position 59670826 in
> hdfs://hmaster1:9000/hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489
> RS 2013-03-10 12:49:12,621 DEBUG
> org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
> RS 2013-03-10 12:49:12,623 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Replicated in total: 31500300
> RS 2013-03-10 12:49:12,623 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
> log for replication hslave1177%2C60020%2C1362549511446.1362944946489 at
> 59670826
> NN 2013-03-10 12:49:12,627 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489.
> blk_6905758215335505153_656717631
> RS 2013-03-10 12:49:12,679 ERROR
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Unexpected exception in ReplicationSource,
> currentPath=hdfs://hmaster1:9000/hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489
> DN 2013-03-10 12:49:12,680 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_6905758215335505153_656717631 src: /192.168.44.1:43503 dest:
> /192.168.44.1:50010
> NN 2013-03-10 12:49:12,804 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.fsync: file
> /hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489
> for DFSClient_hb_rs_hslave1177,60020,1362549511446
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira