[ 
https://issues.apache.org/jira/browse/HBASE-8615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated HBASE-8615:
--------------------------------------

         Priority: Critical  (was: Major)
    Fix Version/s: 0.95.2
                   0.98.0
          Summary: HLog Compression fails in mysterious ways (working title)  
(was: TestReplicationQueueFailoverCompressed#queueFailover fails on hadoop 2.0 
due to IndexOutOfBoundsException)

Here's what I know about the different problems.

The first one is that we find data in the compressed HLog that's unexpected. It 
happens easily on Hadoop 2 and takes more data to hit on Hadoop 1. It manifests 
itself as show in the jira's description or like this:

{noformat}
2013-07-27 15:17:54,789 ERROR [RS:1;vesta:34230.replicationSource,2] 
wal.ProtobufLogReader(236): Error  while reading 4 WAL KVs; started reading at 
65475 and read up to 65541
2013-07-27 15:17:54,790 WARN  [RS:1;vesta:34230.replicationSource,2] 
regionserver.ReplicationSource(323): 2 Got: 
java.io.IOException: Error  while reading 4 WAL KVs; started reading at 65475 
and read up to 65541
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:237)
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:96)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:89)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:407)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:319)
Caused by: java.lang.IllegalArgumentException
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$StreamUtils.toShort(WALCellCodec.java:353)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.readIntoArray(WALCellCodec.java:237)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.parseCell(WALCellCodec.java:206)
        at 
org.apache.hadoop.hbase.codec.BaseDecoder.advance(BaseDecoder.java:46)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFromCells(WALEdit.java:213)
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:217)
        ... 4 more
{noformat}

One thing I saw is that it always happens when we're close to a multiple of 
Short.MAX_VALUE. In the stack trace I just pasted you can see it started 
reading at 65475 and in the jira's description it was ending at 65538.

I'm able to recreate the problem with at patch to 
TestReplicationHLogReaderManager that I'm going to attach later. I also was 
able to recreate the problem on a single node cluster and was able to grab a 
"corrupted" HLog that will also be attached.

The other problem I found is that when appending WALEdits with only 1 KV to a 
compressed HLog, it hits an invalid PB:

{noformat}
2013-07-31 11:38:52,156 ERROR [main] wal.ProtobufLogReader(199):
Invalid PB while reading WAL, probably an unexpected EOF, ignoring
com.google.protobuf.InvalidProtocolBufferException: Protocol message
contained an invalid tag (zero).
        at 
com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:68)
        at 
com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
        at 
org.apache.hadoop.hbase.protobuf.generated.WALProtos$WALKey$Builder.mergeFrom(WALProtos.java:1120)
        at 
org.apache.hadoop.hbase.protobuf.generated.WALProtos$WALKey$Builder.mergeFrom(WALProtos.java:885)
        at 
com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:212)
        at 
com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:746)
        at 
com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:238)
        at 
com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:282)
        at 
com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:760)
        at 
com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:288)
        at 
com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:752)
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:197)
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:96)
{noformat}

Printing the position when it fails I can see it's still around a multiple of 
Short.MAX_VALUE, and using the unit test I attached you can reliably get the 
issue after reading the same number of edits. I wasn't able to trigger the 
issue in Hadoop 1 unfortunately, but it seems related.
                
> HLog Compression fails in mysterious ways (working title)
> ---------------------------------------------------------
>
>                 Key: HBASE-8615
>                 URL: https://issues.apache.org/jira/browse/HBASE-8615
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Ted Yu
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.2
>
>         Attachments: 
> org.apache.hadoop.hbase.replication.TestReplicationQueueFailoverCompressed-output.txt
>
>
> In a recent test run, I noticed the following in test output:
> {code}
> 2013-05-24 22:01:02,424 DEBUG 
> [RegionServer:0;kiyo.gq1.ygridcore.net,42690,1369432806911.replicationSource,2]
>  fs.HFileSystem$ReorderWALBlocks(327): 
> /user/hortonzy/hbase/.logs/kiyo.gq1.ygridcore.net,42690,1369432806911/kiyo.gq1.ygridcore.net%2C42690%2C1369432806911.1369432840428
>  is an HLog file, so reordering blocks, last hostname will 
> be:kiyo.gq1.ygridcore.net
> 2013-05-24 22:01:02,429 DEBUG 
> [RegionServer:0;kiyo.gq1.ygridcore.net,42690,1369432806911.replicationSource,2]
>  wal.ProtobufLogReader(118): After reading the trailer: walEditsStopOffset: 
> 132235, fileLength: 132243, trailerPresent: true
> 2013-05-24 22:01:02,438 ERROR 
> [RegionServer:0;kiyo.gq1.ygridcore.net,42690,1369432806911.replicationSource,2]
>  wal.ProtobufLogReader(236): Error  while reading 691 WAL KVs; started 
> reading at 53272 and read up to 65538
> 2013-05-24 22:01:02,438 WARN  
> [RegionServer:0;kiyo.gq1.ygridcore.net,42690,1369432806911.replicationSource,2]
>  regionserver.ReplicationSource(324): 2 Got:
> java.io.IOException: Error  while reading 691 WAL KVs; started reading at 
> 53272 and read up to 65538
>         at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:237)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:96)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:89)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:404)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:320)
> Caused by: java.lang.IndexOutOfBoundsException: index (30062) must be less 
> than size (1)
>         at 
> com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:305)
>         at 
> com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:284)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.LRUDictionary$BidirectionalLRUMap.get(LRUDictionary.java:124)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.LRUDictionary$BidirectionalLRUMap.access$000(LRUDictionary.java:71)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.LRUDictionary.getEntry(LRUDictionary.java:42)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.readIntoArray(WALCellCodec.java:210)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.parseCell(WALCellCodec.java:184)
>         at 
> org.apache.hadoop.hbase.codec.BaseDecoder.advance(BaseDecoder.java:46)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFromCells(WALEdit.java:213)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:217)
>         ... 4 more
> 2013-05-24 22:01:02,439 DEBUG 
> [RegionServer:0;kiyo.gq1.ygridcore.net,42690,1369432806911.replicationSource,2]
>  regionserver.ReplicationSource(583): Nothing to replicate, sleeping 100 
> times 10
> {code}
> Will attach test output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to