Heap seems a _bit_ bigger than perhaps it should be, but it just
gives the
CMS more work to do.
The hidden problem might be heap fragmentation, since Java uses lots
of
small objects, we might end up with an increasing heap size.
Here is a GC log during the CMS-sweep (when it's reclaiming from
main heap):
42374.524: [CMS-concurrent-sweep-start]
42374.576: [GC 42374.576: [ParNew: 5560K->576K(5568K), 0.0115800 secs]
2709982K->2706003K(2947416K), 0.0117020 secs] [Times: user=0.04
sys=0.00,
real=0.01 secs]
42374.616: [GC 42374.616: [ParNew: 5568K->576K(5568K), 0.0147850 secs]
2703025K->2701054K(2947416K), 0.0148980 secs] [Times: user=0.08
sys=0.00,
real=0.01 secs]
42374.685: [GC 42374.685: [ParNew: 5568K->574K(5568K), 0.0104920 secs]
2690723K->2687015K(2947416K), 0.0106060 secs] [Times: user=0.07
sys=0.00,
real=0.01 secs]
42374.751: [GC 42374.752: [ParNew: 5566K->574K(5568K), 0.0118220 secs]
2672236K->2668439K(2947416K), 0.0119340 secs] [Times: user=0.08
sys=0.00,
real=0.01 secs]
42375.483: [GC 42375.483: [ParNew: 5508K->575K(5568K), 0.0090170 secs]
2329262K->2325313K(2947416K), 0.0091370 secs] [Times: user=0.07
sys=0.00,
real=0.02 secs]
42375.614: [GC 42375.615: [ParNew: 5504K->467K(5568K), 0.0071550 secs]
2275783K->2271206K(2947416K), 0.0072510 secs] [Times: user=0.06
sys=0.00,
real=0.00 secs]
42375.746: [GC 42375.746: [ParNew: 5451K->428K(5568K), 0.0064950 secs]
2210537K->2205514K(2947416K), 0.0066040 secs] [Times: user=0.04
sys=0.00,
real=0.01 secs]
42375.885: [GC 42375.885: [ParNew: 5420K->362K(5568K), 0.0080160 secs]
2149436K->2144384K(2947416K), 0.0081120 secs] [Times: user=0.07
sys=0.00,
real=0.01 secs]
42376.006: [GC 42376.006: [ParNew: 5354K->548K(5568K), 0.0064720 secs]
2097343K->2092544K(2947416K), 0.0065760 secs] [Times: user=0.06
sys=0.00,
real=0.01 secs]
42376.120: [GC 42376.120: [ParNew: 5540K->576K(5568K), 0.0089420 secs]
2044984K->2040242K(2947416K), 0.0090420 secs] [Times: user=0.06
sys=0.00,
real=0.01 secs]
42376.336: [GC 42376.336: [ParNew: 5568K->576K(5568K), 0.0092920 secs]
1948100K->1943594K(2947416K), 0.0093910 secs] [Times: user=0.06
sys=0.01,
real=0.01 secs]
42376.885: [GC 42376.885: [ParNew: 5540K->576K(5568K), 0.0119010 secs]
1677094K->1673092K(2947416K), 0.0120230 secs] [Times: user=0.05
sys=0.00,
real=0.01 secs]
42377.328: [CMS-concurrent-sweep: 2.679/2.804 secs] [Times: user=4.75
sys=0.10, real=2.80 secs]
basically we go from 2700MB -> 1670 MB, losing about 1100MB. Took
about
2.8s and there were 12 minor collections during that time,
decreasing the
size of the ParNew by 10x each time. The GC pauses for ParNew are
higher
than I want, about 6ms each, up to a high of 11ms.
My next thought/experiment is to move to a slightly large ParNew,
maybe
12MB, thus preventing more garbage from entering heap.
However, HBase won't take a rolling restart - as soon as you touch the
.META. regionserver, you're screwed.
On Sat, Apr 25, 2009 at 1:05 PM, Chad Walters <[email protected]
>wrote:
Cool -- that seems like it could be a reasonable set of tradeoffs for
latency sensitive uses.
How big is the memory increase?
Do you know the chief source of the allocations that are GC'd?
Chad
On 4/25/09 12:47 PM, "Ryan Rawson" <[email protected]> wrote:
The incremental GC doesn't help on a multicore system, it's
designed to
break up the giant concurrent phase (which I've seen last 5
minutes) on
systems with 1 or 2 CPUs to prevent the concurrent GC from starving
the app
for CPU. My machines have 8 cores, so this won't really help.
I talked to a friend, I realize the problem is the ParNew is
growing too
big... It starts out at about 15 MB and grows 10x to 150MB. My
friend
suggested that I size ParNew to the L2 cache size, which is 6 MB on
my
Intel
Xeon CPUs - not sure if the L2 size is shared between pairs of
cores or
not.
Switching to a much smaller ParNew has several effects:
- GC pauses are small, tiny, 1ms ish
- GC of ParNew happens 1-10x a second
- Overall memory use goes up as more stuff enters the general heap
and has
to wait for the CMS to reclaim it. I watched the CMS prune 1 GB of
garbage
in 3 seconds.
However, the crippling pauses that were keeping my cluster from
performing
are gone. At the cost of higher ram usage.
On Sat, Apr 25, 2009 at 10:09 AM, Chad Walters
<[email protected]>wrote:
Done any experimentation with the incremental GC?
Chad
On 4/23/09 6:28 PM, "Ryan Rawson" <[email protected]> wrote:
From a log of 0.20 in action:
11061.089: [GC 11061.089: [ParNew: 153331K->17024K(153344K), 0.2249580
secs]
3396126K->3309484K(4579960K) icms_dc=0 , 0.2251020 secs] [Times:
user=0.66
sys=0.04, real=0.23 secs]
11062.439: [GC 11062.439: [ParNew: 153344K->17024K(153344K), 0.1625480
secs]
3445804K->3335150K(4579960K) icms_dc=0 , 0.1627020 secs] [Times:
user=0.37
sys=0.02, real=0.17 secs]
Yes, in 1.4 seconds we had a 225ms pause and a 162ms pause. This
is not
atypical. Most pauses are in the 80-150ms area.
-ryan