Following up on this. Back porting HBASE-4485 didn't seem to help. We were a bit under pressure and I didn't have time to investigate deeper (there's a small chance I missed something during back port)
We eventually upgraded to 0.92 which fixed the problem :) Thanks a lot for helping with this, Cosmin On 2/15/12 1:33 PM, "Cosmin Lehene" <[email protected]> wrote: >Amit, HBASE-4485 describes the behavior I'm seeing, thanks. > >Looking over the patches I'm under the impression that HBASE-4485 which >is a subtask of HBASE-2856 was back ported through HBASE-4838 to 0.92 by >Lars. >Am I wrong? > >Thanks, >Cosmin > > >On 2/14/12 11:06 PM, "Amitanand Aiyer" <[email protected]> wrote: > >>Hi Cosmin, >> https://issues.apache.org/jira/browse/HBASE-4485 might be applicable. >> >> The patch was included in the fix for 2856. >> >>Cheers, >>-Amit >> >>________________________________________ >>From: Cosmin Lehene [[email protected]] >>Sent: Tuesday, February 14, 2012 12:02 PM >>To: [email protected] >>Subject: Re: MR job "randomly" scans up thousands of rows less than the >>it should. >> >>I just got back on this issue. Initially the behavior we've seen (missing >>rows) wouldn't reproduce on 0.90 using TestAcidGuarantees. >>However, if the puts in the writer threads include additional rows the >>scanners will start reading less rows. This reproduces consistently on >>0.90 and seems to be working correctly on 0.92. >> >>HBASE-2856/HBASE-4838 are probably the solution, although there's a >>chance >>it's some other fix on 0.92 (ideas?) >> >>We're undecided whether backporting to 0.90 vs upgrading the affected >>clusters to 0.92 would be better? >>Also is there interest for this fix on 0.90? >> >>Thanks, >>Cosmin >> >>On 2/6/12 6:25 PM, "Cosmin Lehene" <[email protected]> wrote: >> >>>Thanks Ted! >>> >>>I wonder if it would make more sense to port it to 0.90.X or upgrade to >>>0.92. >>> >>>Cosmin >>> >>>On 2/2/12 5:03 PM, "Ted Yu" <[email protected]> wrote: >>> >>>>HBASE-4838 ports HBASE-2856 to 0.92 >>>> >>>>FYI >>>> >>>>On Thu, Feb 2, 2012 at 4:46 PM, Cosmin Lehene <[email protected]> >>>>wrote: >>>> >>>>> (sorry for the damaged subject :)) >>>>> >>>>> >>>>> Hey Jon, >>>>> We have two column families. >>>>> There are no filters and there's a full table scan. We're not >>>>>skipping >>>>> rows. >>>>> I did see however a single time that we had one qualifier "fault" in >>>>>the >>>>> job counters (it was missing, and it wasn't supposed to be missing). >>>>> However that was only once and it doesn't happen when we encounter >>>>>missing >>>>> rows. >>>>> >>>>> We're getting this behavior consistently although I couldn't figure a >>>>>way >>>>> to reproduce it. I'll try running multiple instances of the job in >>>>> parallel to figure out if that would affect the outcome. >>>>> I'll probably have to add more debugging for the affected rows and >>>>>dig >>>>> deeper. >>>>> >>>>> HBASE-2856 is a pretty large issue - do you think it could be related >>>>>to >>>>> what I'm seeing? If so it could help me reproduce it. >>>>> >>>>> Thanks, >>>>> Cosmin >>>>> >>>>> >>>>> >>>>> >>>>> On 2/1/12 11:30 PM, "Jonathan Hsieh" <[email protected]> wrote: >>>>> >>>>> >Cosmin, >>>>> > >>>>> >How many column families to you have in this table? Are you using >>>>>any >>>>> >filters in you HBase scans? Are you using skip rows that may not >>>>>have >>>>> >qualifiers present? >>>>> > >>>>> >There are a few known issues with multi-CF atomicity and a recent >>>>>one >>>>> >about >>>>> >flushes that may be related to this problem. There HBASE-2856, a >>>>>fix >>>>> >having to do with flushes which is pretty intricate and only in >>>>>0.92. >>>>> > >>>>> >Jon. >>>>> > >>>>> >On Wed, Feb 1, 2012 at 8:46 PM, Cosmin Lehene <[email protected]> >>>>>wrote: >>>>> > >>>>> >> We have a MR job that runs every few minutes on some time series >>>>>data >>>>> >> which is continuously updated (never deleted). >>>>> >> Every few (in the range of tens to hundreds) runs the map task >>>>>that >>>>> >>covers >>>>> >> the last region will get fewer input records (off by 500-5000 >>>>>rows) >>>>> >>without >>>>> >> any splits happening. This lower number of input records could >>>>>persist >>>>> >>for >>>>> >> a few MR runs, but will eventually get back to the "correct" >>>>>value. >>>>> >> >>>>> >> This drop can be seen both in the "map input records" metric but >>>>>it's >>>>> >> correlated with the metrics that get computed by the MR job (so >>>>>it's >>>>> >>not a >>>>> >> MR counter bug). >>>>> >> >>>>> >> There are no exceptions in the MR job, or in the region server and >>>>>this >>>>> >> doesn't seem to be correlated with any compaction, split or region >>>>> >>movement. >>>>> >> The only "variable" in this scenario is that new data gets >>>>>injected >>>>> >> continuously (and the actual MR job which is idempotent) >>>>> >> >>>>> >> This entire puzzle takes place on HBase 0.90.5 ish (12 dec 2011) >>>>>on >>>>> >>top >>>>> >> of Hadoop cdh3u2. >>>>> >> >>>>> >> Cosmin >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> > >>>>> > >>>>> >-- >>>>> >// Jonathan Hsieh (shay) >>>>> >// Software Engineer, Cloudera >>>>> >// [email protected] >>>>> >> >
