Wow I should spend more time proofreading mails I send < 8am. 

Restated: I usually see OOME around splits. Seems the heap pressure between 
that and client load is too much. Using LZO compression causes OOME on average 
sooner of course. If I'm reading the GC log correctly, usually this happens in 
the middle of a normal parallel CMS collection. Either the GC is just not 
keeping up or there is a real leak here. Like I said, need to dig in with jhat 
and jprofiler.

My application simulation performs a transitive web page fetch and writes all 
of the content as List<Put> in one transaction. There are 100 fetch worker 
threads. Each has a private pool of 8 threads for fetching transitive resources 
in parallel. I seed the list with the Alexa top 1M web sites and run up in EC2 
where they have fat ingress pipes. So the workload can be extreme, but is a 
reasonable approximation of what we might see in production. I've tried 
clusters of 5 or 10 c1.xlarge instances in EC2 and it doesn't seem to matter. 
With one client it's ok, add another and after we get to the point where there 
are ~10 regions on each RS, they fall like dominos. 

0.20.6 does not do this, interestingly, though the good results with 0.20.6 are 
with an earlier version of my simulation. But I will retest with it next week. 

Best regards,

    - Andy


--- On Thu, 12/2/10, Andrew Purtell <[email protected]> wrote:

> From: Andrew Purtell <[email protected]>
> Subject: Re: [jira] Created: (HBASE-3303) Lower 
> hbase.regionserver.handler.count from 25 back to 10
> To: [email protected]
> Date: Thursday, December 2, 2010, 3:38 PM
> Usually splits for me. Quite similar.
> If I'm reading the GC log correctly, usually in the middle
> of a normal parallel CMS collection. 
> 
> Best regards,
> 
>     - Andy
> 
> 
> --- On Thu, 12/2/10, Todd Lipcon <[email protected]>
> wrote:
> 
> > From: Todd Lipcon <[email protected]>
> > Subject: Re: [jira] Created: (HBASE-3303) Lower
> hbase.regionserver.handler.count from 25 back to 10
> > To: [email protected]
> > Date: Thursday, December 2, 2010, 3:28 PM
> > On Thu, Dec 2, 2010 at 3:21 PM,
> > Jean-Daniel Cryans <[email protected]>wrote:
> > 
> > > Hey Andrew,
> > >
> > > They were still all dead? From session expiration
> or
> > OOME? Or HDFS issues?
> > >
> > >
> > I've found the same in my load testing - it's a
> compaction
> > pause for me.
> > Avoiding heap fragmentation seems to be basically
> > impossible.
> > 
> > -Todd
> > 
> > 
> > > J-D
> > >
> > > On Thu, Dec 2, 2010 at 3:17 PM, Andrew Purtell
> <[email protected]>
> > > wrote:
> > > > J-D,
> > > >
> > > > Your hypothesis is interesting.
> > > >
> > > > I took the same step -- change 100 -> 10
> -- to
> > reduce the probability
> > > that regionservers would OOME under high write
> load as
> > generated by an end
> > > simulation I have been developing, to model an
> > application we plan to
> > > deploy. (Stack, this is the next generation of
> the
> > monster that led us to
> > > find the problem with ByteArrayOutputStream
> buffer
> > management in the 0.19
> > > time frame. It's baaaaack, bigger than before.)
> > > >
> > > > Reducing handler.count did move the needle,
> but
> > sooner or later they are
> > > all dead, at 4G heap, or 8G heap... and the usual
> GC
> > tuning tricks are not
> > > helping.
> > > >
> > > > When I get back from this latest tour of
> Asia
> > next week I need to dig in
> > > with jhat and jprofiler.
> > > >
> > > > Best regards,
> > > >
> > > >    - Andy
> > > >
> > > >
> > > > --- On Thu, 12/2/10, Jean-Daniel Cryans
> (JIRA)
> > <[email protected]>
> > wrote:
> > > >
> > > >> From: Jean-Daniel Cryans (JIRA) <[email protected]>
> > > >> Subject: [jira] Created: (HBASE-3303)
> Lower
> > > hbase.regionserver.handler.count from 25 back to
> 10
> > > >> To: [email protected]
> > > >> Date: Thursday, December 2, 2010, 2:02
> PM
> > > >> Lower
> > > >> hbase.regionserver.handler.count from 25
> back
> > to 10
> > > >>
> >
> ---------------------------------------------------------
> > > >>
> > > >>
> > > >>    Key: HBASE-3303
> > > >>
> > > >>    URL: https://issues.apache.org/jira/browse/HBASE-3303
> > > >>
> > > >>    Project: HBase
> > > >>       
> >    Issue Type: Improvement
> > > >>         
> >    Reporter:
> > > >> Jean-Daniel Cryans
> > > >>         
> >    Assignee:
> > > >> Jean-Daniel Cryans
> > > >>           
> >   Fix
> > > >> For: 0.90.0
> > > >>
> > > >>
> > > >> With HBASE-2506 in mind, I tested a
> > low-memory environment
> > > >> (2GB of heap) with a lot of concurrent
> > writers using the
> > > >> default write buffer to verify if a
> lower
> > number of handlers
> > > >> actually helps reducing the occurrence
> full
> > GCs. Very
> > > >> unscientifically, at this moment I think
> it's
> > safe to say
> > > >> that yes, it helps.
> > > >>
> > > >> With the defaults, I saw a region
> server
> > struggling more
> > > >> and more because the random inserters at
> some
> > point started
> > > >> filling up all the handlers and were
> all
> > BLOCKED trying to
> > > >> sync the WAL. It's safe to say that each
> of
> > those clients
> > > >> carried a payload that the GC cannot get
> rid
> > of and it's one
> > > >> that we don't account for (as opposed
> to
> > MemStore and the
> > > >> block cache).
> > > >>
> > > >> With a much lower setting of 5, I didn't
> see
> > the
> > > >> situation.
> > > >>
> > > >> It kind of confirms my hypothesis but I
> need
> > to do more
> > > >> proper testing. In the mean time, in
> order to
> > lower the
> > > >> onslaught of users that write to the ML
> > complaining about
> > > >> either GCs or OOMEs, I think we should
> set
> > the handlers back
> > > >> to what it was originally (10) for
> 0.90.0 and
> > add some
> > > >> documentation about configuring
> > > >> hbase.regionserver.handler.count
> > > >>
> > > >> I'd like to hear others' thoughts.
> > > >>
> > > >> --
> > > >> This message is automatically generated
> by
> > JIRA.
> > > >> -
> > > >> You can reply to this email to add a
> comment
> > to the issue
> > > >> online.
> > > >>
> > > >>
> > > >
> > > >
> > > >
> > > >
> > >
> > 
> > 
> > 
> > -- 
> > Todd Lipcon
> > Software Engineer, Cloudera
> > 
> 
> 
> 
> 



Reply via email to