My concern with the methodology is that we can get into a dribble mode. Consider the following scenario:

1) We get a usage threshold exceeded notification.
2) We spill, but not enough to activate the garbage collector.
3) Next time the jvm checks, will we still get a usage exceeded threshold? I assume, since the gc won't have run. But at this point it's highly unlikely that we'll spill enough to activate the gc. From here on out we're stuck, spilling little bits but not calling the gc until the system invokes it.

We could mitigate this some by tracking spill sizes across spills and invoking the gc when we reach the threshold. This does not avoid the dribble, it does shorten it.

I think any time we spill we should invoke the gc to avoid the dribble. Pradeep is concerned that this will cause us to invoke the gc too often, which is a possible cause of the error we see. Is it possible to estimate our spill size before we start spilling and decide up front whether to try it or not?
Alan.

Pradeep Kamath wrote:
Hi,

Currently in org.apache.pig.impl.util.SpillableMemoryManger:

1) We use MemoryManagement interface to get notified when the
"collection threshold" exceeds a limit (we set this to
biggest_heap*0.5). With this in place we are still seeing "GC overhead
limit" issues when trying large dataset operations. Observing some runs,
it looks like the notification is not frequent enough and early enough
to prevent memory issues possibly because this notification only occurs
after GC.

2) We only attempt to free upto :

long toFree = info.getUsage().getUsed() -
(long)(info.getUsage().getMax()*.5);

This is only the excess amount over the threshold which caused the
notification and is not sufficient to not be called again soon.

3) While iterating over spillables, if current spillable's memory size
is > gcActivationSize, we try to invoke System.gc

4) We *always* invoke System.gc() after iterating over spillables

Proposed changes are:

=================

1) In addition to "collection threshold" of biggest_heap*0.5, a "usage
threshold" of biggest_heap*0.7 will be used so we get notified early and
often irrespective of whether garbage collection has occured.

2) We will attempt to free
toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
handleNotification() method is handling a "usage threshold exceeded"
notification and (info.getUsage().getMax() * 0.5) otherwise ("collection
threshold exceeded" case)

3) While iterating over spillables, if the *memory freed thus far* is >
gcActivationSize OR if we have freed sufficient memory (based on 2)
above), then we set a flag to invoke System.gc when we exit the loop.
4) We will invoke System.gc() only if the flag is set in 3) above

Please provide thoughts/comments.

Thanks,

Pradeep



Reply via email to