[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Jean-Daniel Cryans (JIRA) Tue, 18 Sep 2012 13:29:09 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458121#comment-13458121
 ]


Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

We applied this patch on a cluster that replicates and about all the nodes 
stopped replicated after some time. This is what I see in the logs:

{noformat}
2012-09-17 20:04:08,111 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log 
for replication va1r3s24%2C10304%2C1347911704238.1347911706318 at 78617132
2012-09-17 20:04:08,120 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Break on 
IOE: 
hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318,
 entryStart=78641557, pos=78771200, end=78771200, edit=84
2012-09-17 20:04:08,120 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 
currentNbOperations:164529 and seenEntries:84 and size: 154068
2012-09-17 20:04:08,120 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating 
84
2012-09-17 20:04:08,146 INFO 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
Going to report log #va1r3s24%2C10304%2C1347911704238.1347911706318 for 
position 78771200 in 
hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318
2012-09-17 20:04:08,158 INFO 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
Removing 0 logs in the list: []
2012-09-17 20:04:08,158 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicated 
in total: 93234
2012-09-17 20:04:08,158 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log 
for replication va1r3s24%2C10304%2C1347911704238.1347911706318 at 78771200
2012-09-17 20:04:08,163 ERROR 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unexpected 
exception in ReplicationSource, 
currentPath=hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318
java.lang.IndexOutOfBoundsException
        at java.io.DataInputStream.readFully(DataInputStream.java:175)
        at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2001)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1901)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1947)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:235)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:394)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:307)
{noformat}

The file is still in HDFS and it's about double the size we see up there, so it 
wasn't the end of the file. Looking at other nodes, we always get "Break on 
IOE" before getting the exception that kills replication. This is why I think 
that this patch is the issue. Somehow reading up to the end is reading too far.

We need to fix or backport.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 
> 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - 
> queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover 
> [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & 
> http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test 
> instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Reply via email to