[
https://issues.apache.org/jira/browse/HBASE-12419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199447#comment-14199447
]
Andrew Purtell edited comment on HBASE-12419 at 11/6/14 12:39 AM:
------------------------------------------------------------------
I built an ingest test (attached) to catch this under debugger. It's a
relatively rare occurrence.
{noformat}
Daemon Thread [RS:0;apurtell-ltm1:55971-EventThread.replicationSource,1]
(Suspended (breakpoint at line 66 in BaseDecoder))
KeyValueCodec$KeyValueDecoder(BaseDecoder).rethrowEofException(IOException)
line: 66
KeyValueCodec$KeyValueDecoder(BaseDecoder).advance() line: 53
in = org.apache.hadoop.hdfs.client.HdfsDataInputStream
.in = org.apache.hadoop.hdfs.DFSInputStream
.blockEnd = 17967301
.lastBlockBeingWrittenLength = 17967302
.lastLocatedBlock.b.block.numBytes = 17967302
.locatedBlocks.blocks.size = 1
.locatedBlocks.isLastBlockComplete = false
.pos = 17967302
WALEdit.readFromCells(Codec$Decoder, int) line: 248
cellDecoder =
org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueDecoder
expectedCount = 1
ProtobufLogReader.readNext(HLog$Entry) line: 317
expectedCount = 1
posBefore = 17967298
ProtobufLogReader(ReaderBase).next(HLog$Entry) line: 106
ProtobufLogReader(ReaderBase).next() line: 91
ReplicationHLogReaderManager.readNextAndSetPosition() line: 86
ReplicationSource.readAllEntriesToReplicateOrNextFile(boolean,
List<Entry>) line: 441
ReplicationSource.run() line: 328
{noformat}
This is a short read from an incomplete file. We then seek back to the last
good known location at ProtobufLineReader.readNext:346
"Encountered a malformed edit, seeking back to last good position in file"
I conclude the reread after seeking back is ultimately successful because the
ingest test finds every written key. I've also done a manual test between two
real clusters on EC2 where I write 1M rows with one cell per row only, and at
the end of the test 1M rows always have arrived at the target cluster even
though BaseDecoder has logged a few of the ERRORs in question.
Is there more to see here?
Should we downgrade the message at BaseDecoder.rethrowEofException:66 from
ERROR to TRACE level?
was (Author: apurtell):
I built an ingest test (attached) to catch this under debugger. It's a
relatively rare occurrence.
{noformat}
Daemon Thread [RS:0;apurtell-ltm1:55971-EventThread.replicationSource,1]
(Suspended (breakpoint at line 66 in BaseDecoder))
KeyValueCodec$KeyValueDecoder(BaseDecoder).rethrowEofException(IOException)
line: 66
KeyValueCodec$KeyValueDecoder(BaseDecoder).advance() line: 53
in = org.apache.hadoop.hdfs.client.HdfsDataInputStream
.in = org.apache.hadoop.hdfs.DFSInputStream
.blockEnd = 17967301
.lastBlockBeingWrittenLength = 17967302
.lastLocatedBlock.b.block.numBytes = 17967302
.locatedBlocks.blocks.size = 1
.locatedBlocks.isLastBlockComplete = false
.pos = 17967302
WALEdit.readFromCells(Codec$Decoder, int) line: 248
cellDecoder =
org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueDecoder
expectedCount = 1
ProtobufLogReader.readNext(HLog$Entry) line: 317
expectedCount = 1
posBefore = 17967298
ProtobufLogReader(ReaderBase).next(HLog$Entry) line: 106
ProtobufLogReader(ReaderBase).next() line: 91
ReplicationHLogReaderManager.readNextAndSetPosition() line: 86
ReplicationSource.readAllEntriesToReplicateOrNextFile(boolean,
List<Entry>) line: 441
ReplicationSource.run() line: 328
{noformat}
This is a short read from an incomplete file. We then seek back to the last
good known location at ProtobufLineReader.readNext:346
"Encountered a malformed edit, seeking back to last good position in file"
I conclude the reread after seeking back is ultimately successful because the
ingest test finds every written key. I've also done a manual test where I write
1M rows with one cell per row only, and at the end of the test 1M rows always
have arrived at the target cluster.
Is there more to see here?
Should we downgrade the message at BaseDecoder.rethrowEofException:66 from
ERROR to TRACE level?
> "Partial cell read caused by EOF" ERRORs on replication source during
> replication
> ---------------------------------------------------------------------------------
>
> Key: HBASE-12419
> URL: https://issues.apache.org/jira/browse/HBASE-12419
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.98.7
> Reporter: Andrew Purtell
> Fix For: 2.0.0, 0.98.8, 0.99.2
>
> Attachments: TestReplicationIngest.patch
>
>
> We are seeing exceptions like these on the replication sources when
> replication is active:
> {noformat}
> 2014-11-04 01:20:19,738 ERROR
> [regionserver8120-EventThread.replicationSource,1] codec.BaseDecoder:
> Partial cell read caused by EOF: java.io.IOException: Premature EOF from
> inputStream
> {noformat}
> HBase 0.98.8-SNAPSHOT, Hadoop 2.4.1.
> Happens both with and without short circuit reads on the source cluster.
> I'm able to reproduce this reliably:
> # Set up two clusters. Can be single slave.
> # Enable replication in configuration
> # Use LoadTestTool -init_only on both clusters
> # On source cluster via shell: alter
> 'cluster_test',{NAME=>'test_cf',REPLICATION_SCOPE=>1}
> # On source cluster via shell: add_peer 'remote:port:/hbase'
> # On source cluster, LoadTestTool -skip_init -write 1:1024:10 -num_keys
> 1000000
> # Wait for LoadTestTool to complete
> # Use the shell to verify 1M rows are in 'cluster_test' on the target cluster.
> All 1M rows will replicate without data loss, but I'll see 5-15 instances of
> "Partial cell read caused by EOF" messages logged from codec.BaseDecoder at
> ERROR level on the replication source.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)