[ https://issues.apache.org/jira/browse/HBASE-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450714#comment-13450714 ]
Devaraj Das commented on HBASE-6733: ------------------------------------ For the first problem, the sequence is: 1. The replicator thread in ReplicationSource fails to find anything to replicate for maxRetriesMultiplier. The thread starts to sleep for sleepForRetries times the max value of sleepMultiplier over and over. In every iteration of the thread's run method, readAllEntriesToReplicateOrNextFile gets called, and at the end of the method, processEndOfFile gets called. 2. At some point the log roller enqueues a WAL file to replicate. 3. Now when processEndOfFile is called, the currentPath is set to null, and the thread's run method gets a new file to replicate (the output of ReplicationSource.getNextPath() call). 4. But the sleepMultiplier is still set to the max value that was set in (1). 5. If there was an exception in reading the new WAL file (enqueued in (2)), the file is incorrectly overly penalized (since the sleepMultiplier is still set to the max)... An example is below: {noformat} 2012-08-31 19:16:19,029 INFO [main] wal.HLog(620): Roll /user/hortonde/hbase/.logs/foo.net,50437,1346440555753/foo.net%2C50437%2C1346440555753.1346440556675, entries=2, filesize=626. for /user/hortonde/hbase/.logs/foo.net,50437,1346440555753/foo.net%2C50437%2C1346440555753.1346440579013 2012-08-31 19:16:19,032 DEBUG [main] wal.SequenceFileLogWriter(126): using new createWriter -- HADOOP-6840 2012-08-31 19:16:19,032 DEBUG [main] wal.SequenceFileLogWriter(136): Path=hdfs://localhost:34512/user/hortonde/hbase/.logs/foo.net,44638,1346440555781/foo.net%2C44638%2C1346440555781.1346440579029, syncFs=true, hflush=false 2012-08-31 19:16:19,033 DEBUG [RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] regionserver.ReplicationSource(474): Opening log for replication foo.net%2C50437%2C1346440555753.1346440556675 at 626 2012-08-31 19:16:19,036 INFO [RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] wal.SequenceFileLogReader(217): hdfs://localhost:34512/user/hortonde/hbase/.logs/foo.net,50437,1346440555753/foo.net%2C50437%2C1346440555753.1346440556675, entryStart=626, pos=626, end=626, edit=0 2012-08-31 19:16:19,036 DEBUG [RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] regionserver.ReplicationSource(429): currentNbOperations:0 and seenEntries:0 and size: 0 2012-08-31 19:16:19,036 DEBUG [RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] regionserver.ReplicationSource(474): Opening log for replication foo.net%2C50437%2C1346440555753.1346440579013 at 0 2012-08-31 19:16:19,037 WARN [RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] regionserver.ReplicationSource(530): 2 Got: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:58) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:166) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:686) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:478) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:289) 2012-08-31 19:16:19,038 WARN [RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] regionserver.ReplicationSource(534): Waited too long for this file, considering dumping {noformat} > [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-2] > --------------------------------------------------------------------------- > > Key: HBASE-6733 > URL: https://issues.apache.org/jira/browse/HBASE-6733 > Project: HBase > Issue Type: Bug > Reporter: Devaraj Das > Fix For: 0.92.3 > > > The failure is in TestReplication.queueFailover (fails due to unreplicated > rows). I have come across two problems: > 1. The sleepMultiplier is not properly reset when the currentPath is changed > (in ReplicationSource.java). > 2. ReplicationExecutor sometime removes files to replicate from the queue too > early, resulting in corresponding edits missing. Here the problem is due to > the fact the log-file length that the replication executor finds is not the > most updated one, and hence it doesn't read anything from there, and > ultimately, when there is a log roll, the replication-queue gets a new entry, > and the executor drops the old entry out of the queue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira