[jira] [Comment Edited] (HBASE-12419) "Partial cell read caused by EOF" ERRORs on replication source during replication

Andrew Purtell (JIRA) Wed, 05 Nov 2014 16:40:17 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-12419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199447#comment-14199447
 ]


Andrew Purtell edited comment on HBASE-12419 at 11/6/14 12:39 AM:
------------------------------------------------------------------

I built an ingest test (attached) to catch this under debugger. It's a 
relatively rare occurrence. 

{noformat}
Daemon Thread [RS:0;apurtell-ltm1:55971-EventThread.replicationSource,1] 
(Suspended (breakpoint at line 66 in BaseDecoder))     
        
KeyValueCodec$KeyValueDecoder(BaseDecoder).rethrowEofException(IOException) 
line: 66    
        KeyValueCodec$KeyValueDecoder(BaseDecoder).advance() line: 53
          in = org.apache.hadoop.hdfs.client.HdfsDataInputStream
            .in = org.apache.hadoop.hdfs.DFSInputStream
              .blockEnd = 17967301
              .lastBlockBeingWrittenLength = 17967302
              .lastLocatedBlock.b.block.numBytes = 17967302
              .locatedBlocks.blocks.size = 1
              .locatedBlocks.isLastBlockComplete = false
              .pos = 17967302              
        WALEdit.readFromCells(Codec$Decoder, int) line: 248
          cellDecoder = 
org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueDecoder     
          expectedCount = 1
        ProtobufLogReader.readNext(HLog$Entry) line: 317
          expectedCount = 1
          posBefore = 17967298
        ProtobufLogReader(ReaderBase).next(HLog$Entry) line: 106        
        ProtobufLogReader(ReaderBase).next() line: 91   
        ReplicationHLogReaderManager.readNextAndSetPosition() line: 86  
        ReplicationSource.readAllEntriesToReplicateOrNextFile(boolean, 
List<Entry>) line: 441   
        ReplicationSource.run() line: 328       
{noformat}

This is a short read from an incomplete file. We then seek back to the last 
good known location at ProtobufLineReader.readNext:346

"Encountered a malformed edit, seeking back to last good position in file"

I conclude the reread after seeking back is ultimately successful because the 
ingest test finds every written key. I've also done a manual test between two 
real clusters on EC2 where I write 1M rows with one cell per row only, and at 
the end of the test 1M rows always have arrived at the target cluster even 
though BaseDecoder has logged a few of the ERRORs in question.

Is there more to see here?

Should we downgrade the message at BaseDecoder.rethrowEofException:66 from 
ERROR to TRACE level?



was (Author: apurtell):
I built an ingest test (attached) to catch this under debugger. It's a 
relatively rare occurrence. 

{noformat}
Daemon Thread [RS:0;apurtell-ltm1:55971-EventThread.replicationSource,1] 
(Suspended (breakpoint at line 66 in BaseDecoder))     
        
KeyValueCodec$KeyValueDecoder(BaseDecoder).rethrowEofException(IOException) 
line: 66    
        KeyValueCodec$KeyValueDecoder(BaseDecoder).advance() line: 53
          in = org.apache.hadoop.hdfs.client.HdfsDataInputStream
            .in = org.apache.hadoop.hdfs.DFSInputStream
              .blockEnd = 17967301
              .lastBlockBeingWrittenLength = 17967302
              .lastLocatedBlock.b.block.numBytes = 17967302
              .locatedBlocks.blocks.size = 1
              .locatedBlocks.isLastBlockComplete = false
              .pos = 17967302              
        WALEdit.readFromCells(Codec$Decoder, int) line: 248
          cellDecoder = 
org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueDecoder     
          expectedCount = 1
        ProtobufLogReader.readNext(HLog$Entry) line: 317
          expectedCount = 1
          posBefore = 17967298
        ProtobufLogReader(ReaderBase).next(HLog$Entry) line: 106        
        ProtobufLogReader(ReaderBase).next() line: 91   
        ReplicationHLogReaderManager.readNextAndSetPosition() line: 86  
        ReplicationSource.readAllEntriesToReplicateOrNextFile(boolean, 
List<Entry>) line: 441   
        ReplicationSource.run() line: 328       
{noformat}

This is a short read from an incomplete file. We then seek back to the last 
good known location at ProtobufLineReader.readNext:346

"Encountered a malformed edit, seeking back to last good position in file"

I conclude the reread after seeking back is ultimately successful because the 
ingest test finds every written key. I've also done a manual test where I write 
1M rows with one cell per row only, and at the end of the test 1M rows always 
have arrived at the target cluster.

Is there more to see here?

Should we downgrade the message at BaseDecoder.rethrowEofException:66 from 
ERROR to TRACE level?


> "Partial cell read caused by EOF" ERRORs on replication source during 
> replication
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-12419
>                 URL: https://issues.apache.org/jira/browse/HBASE-12419
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.7
>            Reporter: Andrew Purtell
>             Fix For: 2.0.0, 0.98.8, 0.99.2
>
>         Attachments: TestReplicationIngest.patch
>
>
> We are seeing exceptions like these on the replication sources when 
> replication is active:
> {noformat}
> 2014-11-04 01:20:19,738 ERROR 
> [regionserver8120-EventThread.replicationSource,1] codec.BaseDecoder:
> Partial cell read caused by EOF: java.io.IOException: Premature EOF from 
> inputStream
> {noformat}
> HBase 0.98.8-SNAPSHOT, Hadoop 2.4.1.
> Happens both with and without short circuit reads on the source cluster.
> I'm able to reproduce this reliably:
> # Set up two clusters. Can be single slave.
> # Enable replication in configuration
> # Use LoadTestTool -init_only on both clusters
> # On source cluster via shell: alter 
> 'cluster_test',{NAME=>'test_cf',REPLICATION_SCOPE=>1}
> # On source cluster via shell: add_peer 'remote:port:/hbase'
> # On source cluster, LoadTestTool -skip_init -write 1:1024:10 -num_keys 
> 1000000
> # Wait for LoadTestTool to complete
> # Use the shell to verify 1M rows are in 'cluster_test' on the target cluster.
> All 1M rows will replicate without data loss, but I'll see 5-15 instances of 
> "Partial cell read caused by EOF" messages logged from codec.BaseDecoder at 
> ERROR level on the replication source. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HBASE-12419) "Partial cell read caused by EOF" ERRORs on replication source during replication

Reply via email to