Re: MR job "randomly" scans up thousands of rows less than the it should.

Cosmin Lehene Wed, 15 Feb 2012 03:34:02 -0800

Amit, HBASE-4485 describes the behavior I'm seeing, thanks.

Looking over the patches I'm under the impression  that HBASE-4485 which
is a subtask of HBASE-2856 was back ported through HBASE-4838 to 0.92 by
Lars.
Am I wrong?


Thanks,
Cosmin


On 2/14/12 11:06 PM, "Amitanand Aiyer" <[email protected]> wrote:

>Hi Cosmin,
>  https://issues.apache.org/jira/browse/HBASE-4485 might be applicable.
>
>  The patch was included in the fix for 2856.
>
>Cheers,
>-Amit
>
>________________________________________
>From: Cosmin Lehene [[email protected]]
>Sent: Tuesday, February 14, 2012 12:02 PM
>To: [email protected]
>Subject: Re: MR job "randomly" scans up thousands of rows less than the
>it should.
>
>I just got back on this issue. Initially the behavior we've seen (missing
>rows) wouldn't reproduce on 0.90 using TestAcidGuarantees.
>However, if the puts in the writer threads include additional rows the
>scanners will start reading less rows. This reproduces consistently on
>0.90 and seems to be working correctly on 0.92.
>
>HBASE-2856/HBASE-4838 are probably the solution, although there's a chance
>it's some other fix on 0.92 (ideas?)
>
>We're undecided whether backporting to 0.90 vs upgrading the affected
>clusters to 0.92 would be better?
>Also is there interest for this fix on 0.90?
>
>Thanks,
>Cosmin
>
>On 2/6/12 6:25 PM, "Cosmin Lehene" <[email protected]> wrote:
>
>>Thanks Ted!
>>
>>I wonder if it would make more sense to port it to 0.90.X or upgrade to
>>0.92.
>>
>>Cosmin
>>
>>On 2/2/12 5:03 PM, "Ted Yu" <[email protected]> wrote:
>>
>>>HBASE-4838 ports HBASE-2856 to 0.92
>>>
>>>FYI
>>>
>>>On Thu, Feb 2, 2012 at 4:46 PM, Cosmin Lehene <[email protected]> wrote:
>>>
>>>> (sorry for the damaged subject :))
>>>>
>>>>
>>>> Hey Jon,
>>>> We have two column families.
>>>> There are no filters and there's a full table scan. We're not skipping
>>>> rows.
>>>> I did see however a single time that we had one qualifier "fault" in
>>>>the
>>>> job counters (it was missing, and it wasn't supposed to be missing).
>>>> However that was only once and it doesn't happen when we encounter
>>>>missing
>>>> rows.
>>>>
>>>> We're getting this behavior consistently although I couldn't figure a
>>>>way
>>>> to reproduce it. I'll try running multiple instances of the job in
>>>> parallel to figure out if that would affect the outcome.
>>>> I'll probably have to add more debugging for the affected rows and dig
>>>> deeper.
>>>>
>>>> HBASE-2856 is a pretty large issue - do you think it could be related
>>>>to
>>>> what I'm seeing? If so it could help me reproduce it.
>>>>
>>>> Thanks,
>>>> Cosmin
>>>>
>>>>
>>>>
>>>>
>>>> On 2/1/12 11:30 PM, "Jonathan Hsieh" <[email protected]> wrote:
>>>>
>>>> >Cosmin,
>>>> >
>>>> >How many column families to you have in this table?   Are you using
>>>>any
>>>> >filters in you HBase scans?  Are you using skip rows that may not
>>>>have
>>>> >qualifiers present?
>>>> >
>>>> >There are a few known issues with multi-CF atomicity and a recent one
>>>> >about
>>>> >flushes that may be related to this problem.  There HBASE-2856, a fix
>>>> >having to do with flushes which is pretty intricate and only in 0.92.
>>>> >
>>>> >Jon.
>>>> >
>>>> >On Wed, Feb 1, 2012 at 8:46 PM, Cosmin Lehene <[email protected]>
>>>>wrote:
>>>> >
>>>> >> We have a MR job that runs every few minutes on some time series
>>>>data
>>>> >> which is continuously updated (never deleted).
>>>> >> Every few (in the range of tens to hundreds) runs the map task that
>>>> >>covers
>>>> >> the last region will get fewer input records (off by 500-5000 rows)
>>>> >>without
>>>> >> any splits happening. This lower number of input records could
>>>>persist
>>>> >>for
>>>> >> a few MR runs, but will eventually get back to the "correct" value.
>>>> >>
>>>> >> This drop can be seen both in the "map input records" metric but
>>>>it's
>>>> >> correlated with the metrics that get computed by the MR job (so
>>>>it's
>>>> >>not a
>>>> >> MR counter bug).
>>>> >>
>>>> >> There are no exceptions in the MR job, or in the region server and
>>>>this
>>>> >> doesn't seem to be correlated with any compaction, split or region
>>>> >>movement.
>>>> >> The only "variable" in this scenario is that new data gets injected
>>>> >> continuously (and the actual MR job which is idempotent)
>>>> >>
>>>> >> This entire puzzle takes place on  HBase 0.90.5 ish (12 dec 2011)
>>>>on
>>>> >>top
>>>> >> of Hadoop cdh3u2.
>>>> >>
>>>> >> Cosmin
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>> >--
>>>> >// Jonathan Hsieh (shay)
>>>> >// Software Engineer, Cloudera
>>>> >// [email protected]
>>>>
>

Re: MR job "randomly" scans up thousands of rows less than the it should.

Reply via email to