[ https://issues.apache.org/jira/browse/HBASE-12419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199447#comment-14199447 ]
Andrew Purtell edited comment on HBASE-12419 at 11/6/14 12:27 AM: ------------------------------------------------------------------ I built an ingest test (attached) to catch this under debugger. It's a relatively rare occurrence. {noformat} Daemon Thread [RS:0;apurtell-ltm1:55971-EventThread.replicationSource,1] (Suspended (breakpoint at line 66 in BaseDecoder)) KeyValueCodec$KeyValueDecoder(BaseDecoder).rethrowEofException(IOException) line: 66 KeyValueCodec$KeyValueDecoder(BaseDecoder).advance() line: 53 in = org.apache.hadoop.hdfs.client.HdfsDataInputStream .in = org.apache.hadoop.hdfs.DFSInputStream .blockEnd = 17967301 .lastBlockBeingWrittenLength = 17967302 .lastLocatedBlock.b.block.numBytes = 17967302 .locatedBlocks.blocks.size = 1 .locatedBlocks.isLastBlockComplete = false .pos = 17967302 WALEdit.readFromCells(Codec$Decoder, int) line: 248 cellDecoder = org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueDecoder expectedCount = 1 ProtobufLogReader.readNext(HLog$Entry) line: 317 expectedCount = 1 posBefore = 17967298 ProtobufLogReader(ReaderBase).next(HLog$Entry) line: 106 ProtobufLogReader(ReaderBase).next() line: 91 ReplicationHLogReaderManager.readNextAndSetPosition() line: 86 ReplicationSource.readAllEntriesToReplicateOrNextFile(boolean, List<Entry>) line: 441 ReplicationSource.run() line: 328 {noformat} This is a short read from an incomplete file. We then seek back to the last good known location at ProtobufLineReader.readNext:346 "Encountered a malformed edit, seeking back to last good position in file" I conclude the reread after seeking back is ultimately successful because I am replicating rows with only one cell per row and row counts are correct at the end of testing. Is there more to see here? Should we downgrade the message at BaseDecoder.rethrowEofException:66 from ERROR to TRACE level? was (Author: apurtell): I built an ingest test (attached) to catch this under debugger. It's a relatively rare occurrence. {noformat} Daemon Thread [RS:0;apurtell-ltm1:55971-EventThread.replicationSource,1] (Suspended (breakpoint at line 66 in BaseDecoder)) KeyValueCodec$KeyValueDecoder(BaseDecoder).rethrowEofException(IOException) line: 66 KeyValueCodec$KeyValueDecoder(BaseDecoder).advance() line: 53 in = org.apache.hadoop.hdfs.client.HdfsDataInputStream .in = org.apache.hadoop.hdfs.DFSInputStream .blockEnd = 17967301 .lastBlockBeingWrittenLength = 17967302 .lastLocatedBlock.b.block.numBytes = 17967302 .locatedBlocks.blocks.size = 1 .locatedBlocks.isLastBlockComplete = false .pos = 17967302 WALEdit.readFromCells(Codec$Decoder, int) line: 248 cellDecoder = org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueDecoder expectedCount = 1 ProtobufLogReader.readNext(HLog$Entry) line: 317 expectedCount = 1 posBefore = 17967298 ProtobufLogReader(ReaderBase).next(HLog$Entry) line: 106 ProtobufLogReader(ReaderBase).next() line: 91 ReplicationHLogReaderManager.readNextAndSetPosition() line: 86 ReplicationSource.readAllEntriesToReplicateOrNextFile(boolean, List<Entry>) line: 441 ReplicationSource.run() line: 328 {noformat} This is a short read from an incomplete last block. We then seek back to the last good known location at ProtobufLineReader.readNext:346 "Encountered a malformed edit, seeking back to last good position in file" I conclude the reread after seeking back is ultimately successful because I am replicating rows with only one cell per row and row counts are correct at the end of testing. Is there more to see here? Should we downgrade the message at BaseDecoder.rethrowEofException:66 from ERROR to TRACE level? > "Partial cell read caused by EOF" ERRORs on replication source during > replication > --------------------------------------------------------------------------------- > > Key: HBASE-12419 > URL: https://issues.apache.org/jira/browse/HBASE-12419 > Project: HBase > Issue Type: Bug > Affects Versions: 0.98.7 > Reporter: Andrew Purtell > Fix For: 2.0.0, 0.98.8, 0.99.2 > > Attachments: TestReplicationIngest.patch > > > We are seeing exceptions like these on the replication sources when > replication is active: > {noformat} > 2014-11-04 01:20:19,738 ERROR > [regionserver8120-EventThread.replicationSource,1] codec.BaseDecoder: > Partial cell read caused by EOF: java.io.IOException: Premature EOF from > inputStream > {noformat} > HBase 0.98.8-SNAPSHOT, Hadoop 2.4.1. > Happens both with and without short circuit reads on the source cluster. > I'm able to reproduce this reliably: > # Set up two clusters. Can be single slave. > # Enable replication in configuration > # Use LoadTestTool -init_only on both clusters > # On source cluster via shell: alter > 'cluster_test',{NAME=>'test_cf',REPLICATION_SCOPE=>1} > # On source cluster via shell: add_peer 'remote:port:/hbase' > # On source cluster, LoadTestTool -skip_init -write 1:1024:10 -num_keys > 1000000 > # Wait for LoadTestTool to complete > # Use the shell to verify 1M rows are in 'cluster_test' on the target cluster. > All 1M rows will replicate without data loss, but I'll see 5-15 instances of > "Partial cell read caused by EOF" messages logged from codec.BaseDecoder at > ERROR level on the replication source. -- This message was sent by Atlassian JIRA (v6.3.4#6332)