[jira] [Commented] (HBASE-6733) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-2]

Devaraj Das (JIRA) Fri, 07 Sep 2012 08:26:10 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450714#comment-13450714
 ]


Devaraj Das commented on HBASE-6733:
------------------------------------

For the first problem, the sequence is:
1. The replicator thread in ReplicationSource fails to find anything to 
replicate for maxRetriesMultiplier. The thread starts to sleep for 
sleepForRetries times the max value of sleepMultiplier over and over. In every 
iteration of the thread's run method, readAllEntriesToReplicateOrNextFile gets 
called, and at the end of the method, processEndOfFile gets called. 

2. At some point the log roller enqueues a WAL file to replicate.

3. Now when processEndOfFile is called, the currentPath is set to null, and the 
thread's run method gets a new file to replicate (the output of 
ReplicationSource.getNextPath() call). 

4. But the sleepMultiplier is still set to the max value that was set in (1).

5. If there was an exception in reading the new WAL file (enqueued in (2)), the 
file is incorrectly overly penalized (since the sleepMultiplier is still set to 
the max)... An example is below:

{noformat}
2012-08-31 19:16:19,029 INFO  [main] wal.HLog(620): Roll 
/user/hortonde/hbase/.logs/foo.net,50437,1346440555753/foo.net%2C50437%2C1346440555753.1346440556675,
 entries=2, filesize=626.  for 
/user/hortonde/hbase/.logs/foo.net,50437,1346440555753/foo.net%2C50437%2C1346440555753.1346440579013
2012-08-31 19:16:19,032 DEBUG [main] wal.SequenceFileLogWriter(126): using new 
createWriter -- HADOOP-6840
2012-08-31 19:16:19,032 DEBUG [main] wal.SequenceFileLogWriter(136): 
Path=hdfs://localhost:34512/user/hortonde/hbase/.logs/foo.net,44638,1346440555781/foo.net%2C44638%2C1346440555781.1346440579029,
 syncFs=true, hflush=false
2012-08-31 19:16:19,033 DEBUG 
[RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] 
regionserver.ReplicationSource(474): Opening log for replication 
foo.net%2C50437%2C1346440555753.1346440556675 at 626
2012-08-31 19:16:19,036 INFO  
[RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] 
wal.SequenceFileLogReader(217): 
hdfs://localhost:34512/user/hortonde/hbase/.logs/foo.net,50437,1346440555753/foo.net%2C50437%2C1346440555753.1346440556675,
 entryStart=626, pos=626, end=626, edit=0
2012-08-31 19:16:19,036 DEBUG 
[RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] 
regionserver.ReplicationSource(429): currentNbOperations:0 and seenEntries:0 
and size: 0
2012-08-31 19:16:19,036 DEBUG 
[RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] 
regionserver.ReplicationSource(474): Opening log for replication 
foo.net%2C50437%2C1346440555753.1346440579013 at 0
2012-08-31 19:16:19,037 WARN  
[RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] 
regionserver.ReplicationSource(530): 2 Got:
java.io.EOFException                
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
        at 
org.apache.hadoop.io.SequenceFile$Reader.&lt;init&gt;(SequenceFile.java:1486)
        at 
org.apache.hadoop.io.SequenceFile$Reader.&lt;init&gt;(SequenceFile.java:1475)
        at 
org.apache.hadoop.io.SequenceFile$Reader.&lt;init&gt;(SequenceFile.java:1470)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.&lt;init&gt;(SequenceFileLogReader.java:58)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:166)
        at 
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:686)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:478)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:289)
2012-08-31 19:16:19,038 WARN  
[RegionServer:0;foo.net,50437,1346440555753.replicationSource,2] 
regionserver.ReplicationSource(534): Waited too long for this file, considering 
dumping
{noformat} 
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-2]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6733
>                 URL: https://issues.apache.org/jira/browse/HBASE-6733
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>             Fix For: 0.92.3
>
>
> The failure is in TestReplication.queueFailover (fails due to unreplicated 
> rows). I have come across two problems:
> 1. The sleepMultiplier is not properly reset when the currentPath is changed 
> (in ReplicationSource.java).
> 2. ReplicationExecutor sometime removes files to replicate from the queue too 
> early, resulting in corresponding edits missing. Here the problem is due to 
> the fact the log-file length that the replication executor finds is not the 
> most updated one, and hence it doesn't read anything from there, and 
> ultimately, when there is a log roll, the replication-queue gets a new entry, 
> and the executor drops the old entry out of the queue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6733) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-2]

Reply via email to