Thanks Josh and Devaraj! I will try to increase the timeouts. Devaraj, could you share the parameters you used for this test which worked?
On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das <d...@hortonworks.com> wrote: > That sounds about right, Josh. Peter, in our internal testing we have seen > this test failing and increasing timeouts (look at the test code options to > do with increasing timeout) helped quite some. > ________________________________________ > From: Josh Elser <josh.el...@gmail.com> > Sent: Wednesday, June 14, 2017 3:17 PM > To: dev@hbase.apache.org > Subject: Re: Problem with IntegrationTestRegionReplicaReplication > > On 6/14/17 3:53 AM, Peter Somogyi wrote: > > Hi, > > > > As one of my first task with HBase I started to look into > > why IntegrationTestRegionReplicaReplication fails. I would like to get > some > > suggestions from you. > > > > I noticed when I run the test using normal cluster or minicluster I get > the > > same error messages: "Error checking data for key [null], no data > > returned". I looked into the code and here are my conclusions. > > > > There are multiple threads writing data parallel which are read by > multiple > > reader threads simultaneously. Each writer gets a portion of the keys to > > write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue. > > The reader threads get the elements (e.g. key=1000) from the queue and > > these reader threads assume that all the keys up to this are already in > the > > database. Since we're using multiple writers it can happen that another > > thread has not yet written key=500 and verifying these keys will cause > the > > test failure. > > > > Do you think my assumption is correct? > > Hi Peter, > > No, as my memory serves, this is not correct. Readers are not made aware > of keys to verify until the write occur plus some delay. The delay is > used to provide enough time for the internal region replication to take > effect. > > So: primary-write, pause, [region replication happens in background], > add updated key to read queue, reader gets key from queue verifies the > value on a replica. > > The primary should always have seen the new value for a key. If the test > is showing that a replica does not see the result, it's either a timing > issue (you need to give a larger delay for HBase to perform the region > replication) or a bug in the region replication framework itself. That > said, if you can show that you are seeing what you describe, that sounds > like the test framework itself is broken :) > > > >