That sounds about right, Josh. Peter, in our internal testing we have seen this test failing and increasing timeouts (look at the test code options to do with increasing timeout) helped quite some. ________________________________________ From: Josh Elser <[email protected]> Sent: Wednesday, June 14, 2017 3:17 PM To: [email protected] Subject: Re: Problem with IntegrationTestRegionReplicaReplication
On 6/14/17 3:53 AM, Peter Somogyi wrote: > Hi, > > As one of my first task with HBase I started to look into > why IntegrationTestRegionReplicaReplication fails. I would like to get some > suggestions from you. > > I noticed when I run the test using normal cluster or minicluster I get the > same error messages: "Error checking data for key [null], no data > returned". I looked into the code and here are my conclusions. > > There are multiple threads writing data parallel which are read by multiple > reader threads simultaneously. Each writer gets a portion of the keys to > write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue. > The reader threads get the elements (e.g. key=1000) from the queue and > these reader threads assume that all the keys up to this are already in the > database. Since we're using multiple writers it can happen that another > thread has not yet written key=500 and verifying these keys will cause the > test failure. > > Do you think my assumption is correct? Hi Peter, No, as my memory serves, this is not correct. Readers are not made aware of keys to verify until the write occur plus some delay. The delay is used to provide enough time for the internal region replication to take effect. So: primary-write, pause, [region replication happens in background], add updated key to read queue, reader gets key from queue verifies the value on a replica. The primary should always have seen the new value for a key. If the test is showing that a replica does not see the result, it's either a timing issue (you need to give a larger delay for HBase to perform the region replication) or a bug in the region replication framework itself. That said, if you can show that you are seeing what you describe, that sounds like the test framework itself is broken :)
