[jira] [Comment Edited] (HBASE-12419) "Partial cell read caused by EOF" ERRORs on replication source during replication

Andrew Purtell (JIRA) Wed, 05 Nov 2014 16:27:56 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-12419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199447#comment-14199447
 ]


Andrew Purtell edited comment on HBASE-12419 at 11/6/14 12:27 AM:
------------------------------------------------------------------

I built an ingest test (attached) to catch this under debugger. It's a 
relatively rare occurrence. 

{noformat}
Daemon Thread [RS:0;apurtell-ltm1:55971-EventThread.replicationSource,1] 
(Suspended (breakpoint at line 66 in BaseDecoder))     
        
KeyValueCodec$KeyValueDecoder(BaseDecoder).rethrowEofException(IOException) 
line: 66    
        KeyValueCodec$KeyValueDecoder(BaseDecoder).advance() line: 53
          in = org.apache.hadoop.hdfs.client.HdfsDataInputStream
            .in = org.apache.hadoop.hdfs.DFSInputStream
              .blockEnd = 17967301
              .lastBlockBeingWrittenLength = 17967302
              .lastLocatedBlock.b.block.numBytes = 17967302
              .locatedBlocks.blocks.size = 1
              .locatedBlocks.isLastBlockComplete = false
              .pos = 17967302              
        WALEdit.readFromCells(Codec$Decoder, int) line: 248
          cellDecoder = 
org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueDecoder     
          expectedCount = 1
        ProtobufLogReader.readNext(HLog$Entry) line: 317
          expectedCount = 1
          posBefore = 17967298
        ProtobufLogReader(ReaderBase).next(HLog$Entry) line: 106        
        ProtobufLogReader(ReaderBase).next() line: 91   
        ReplicationHLogReaderManager.readNextAndSetPosition() line: 86  
        ReplicationSource.readAllEntriesToReplicateOrNextFile(boolean, 
List<Entry>) line: 441   
        ReplicationSource.run() line: 328       
{noformat}

This is a short read from an incomplete file. We then seek back to the last 
good known location at ProtobufLineReader.readNext:346

"Encountered a malformed edit, seeking back to last good position in file"

I conclude the reread after seeking back is ultimately successful because I am 
replicating rows with only one cell per row and row counts are correct at the 
end of testing.

Is there more to see here?

Should we downgrade the message at BaseDecoder.rethrowEofException:66 from 
ERROR to TRACE level?



was (Author: apurtell):
I built an ingest test (attached) to catch this under debugger. It's a 
relatively rare occurrence. 

{noformat}
Daemon Thread [RS:0;apurtell-ltm1:55971-EventThread.replicationSource,1] 
(Suspended (breakpoint at line 66 in BaseDecoder))     
        
KeyValueCodec$KeyValueDecoder(BaseDecoder).rethrowEofException(IOException) 
line: 66    
        KeyValueCodec$KeyValueDecoder(BaseDecoder).advance() line: 53
          in = org.apache.hadoop.hdfs.client.HdfsDataInputStream
            .in = org.apache.hadoop.hdfs.DFSInputStream
              .blockEnd = 17967301
              .lastBlockBeingWrittenLength = 17967302
              .lastLocatedBlock.b.block.numBytes = 17967302
              .locatedBlocks.blocks.size = 1
              .locatedBlocks.isLastBlockComplete = false
              .pos = 17967302              
        WALEdit.readFromCells(Codec$Decoder, int) line: 248
          cellDecoder = 
org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueDecoder     
          expectedCount = 1
        ProtobufLogReader.readNext(HLog$Entry) line: 317
          expectedCount = 1
          posBefore = 17967298
        ProtobufLogReader(ReaderBase).next(HLog$Entry) line: 106        
        ProtobufLogReader(ReaderBase).next() line: 91   
        ReplicationHLogReaderManager.readNextAndSetPosition() line: 86  
        ReplicationSource.readAllEntriesToReplicateOrNextFile(boolean, 
List<Entry>) line: 441   
        ReplicationSource.run() line: 328       
{noformat}

This is a short read from an incomplete last block. We then seek back to the 
last good known location at ProtobufLineReader.readNext:346

"Encountered a malformed edit, seeking back to last good position in file"

I conclude the reread after seeking back is ultimately successful because I am 
replicating rows with only one cell per row and row counts are correct at the 
end of testing.

Is there more to see here?

Should we downgrade the message at BaseDecoder.rethrowEofException:66 from 
ERROR to TRACE level?


> "Partial cell read caused by EOF" ERRORs on replication source during 
> replication
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-12419
>                 URL: https://issues.apache.org/jira/browse/HBASE-12419
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.7
>            Reporter: Andrew Purtell
>             Fix For: 2.0.0, 0.98.8, 0.99.2
>
>         Attachments: TestReplicationIngest.patch
>
>
> We are seeing exceptions like these on the replication sources when 
> replication is active:
> {noformat}
> 2014-11-04 01:20:19,738 ERROR 
> [regionserver8120-EventThread.replicationSource,1] codec.BaseDecoder:
> Partial cell read caused by EOF: java.io.IOException: Premature EOF from 
> inputStream
> {noformat}
> HBase 0.98.8-SNAPSHOT, Hadoop 2.4.1.
> Happens both with and without short circuit reads on the source cluster.
> I'm able to reproduce this reliably:
> # Set up two clusters. Can be single slave.
> # Enable replication in configuration
> # Use LoadTestTool -init_only on both clusters
> # On source cluster via shell: alter 
> 'cluster_test',{NAME=>'test_cf',REPLICATION_SCOPE=>1}
> # On source cluster via shell: add_peer 'remote:port:/hbase'
> # On source cluster, LoadTestTool -skip_init -write 1:1024:10 -num_keys 
> 1000000
> # Wait for LoadTestTool to complete
> # Use the shell to verify 1M rows are in 'cluster_test' on the target cluster.
> All 1M rows will replicate without data loss, but I'll see 5-15 instances of 
> "Partial cell read caused by EOF" messages logged from codec.BaseDecoder at 
> ERROR level on the replication source. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HBASE-12419) "Partial cell read caused by EOF" ERRORs on replication source during replication

Reply via email to