[
https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458121#comment-13458121
]
Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------
We applied this patch on a cluster that replicates and about all the nodes
stopped replicated after some time. This is what I see in the logs:
{noformat}
2012-09-17 20:04:08,111 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log
for replication va1r3s24%2C10304%2C1347911704238.1347911706318 at 78617132
2012-09-17 20:04:08,120 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Break on
IOE:
hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318,
entryStart=78641557, pos=78771200, end=78771200, edit=84
2012-09-17 20:04:08,120 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
currentNbOperations:164529 and seenEntries:84 and size: 154068
2012-09-17 20:04:08,120 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating
84
2012-09-17 20:04:08,146 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Going to report log #va1r3s24%2C10304%2C1347911704238.1347911706318 for
position 78771200 in
hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318
2012-09-17 20:04:08,158 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Removing 0 logs in the list: []
2012-09-17 20:04:08,158 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicated
in total: 93234
2012-09-17 20:04:08,158 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log
for replication va1r3s24%2C10304%2C1347911704238.1347911706318 at 78771200
2012-09-17 20:04:08,163 ERROR
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unexpected
exception in ReplicationSource,
currentPath=hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318
java.lang.IndexOutOfBoundsException
at java.io.DataInputStream.readFully(DataInputStream.java:175)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2001)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1901)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1947)
at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:235)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:394)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:307)
{noformat}
The file is still in HDFS and it's about double the size we see up there, so it
wasn't the end of the file. Looking at other nodes, we always get "Break on
IOE" before getting the exception that kills replication. This is why I think
that this patch is the issue. Somehow reading up to the end is reading too far.
We need to fix or backport.
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
> Key: HBASE-6649
> URL: https://issues.apache.org/jira/browse/HBASE-6649
> Project: HBase
> Issue Type: Bug
> Reporter: Devaraj Das
> Assignee: Devaraj Das
> Fix For: 0.96.0, 0.92.3, 0.94.2
>
> Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt,
> 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test -
> queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover
> [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB &
> http://bit.ly/O79Dq7 ..
> Looking briefly at the logs hints at a pattern - in both the failed test
> instances, there was an RS crash while the test was running.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira