The GC overhead limit error could occur even when we are not low on
memory but if memory is fragmented and if the GC spends too much time
freeing little memory. Also, we don't want to slow down performance by
invoking it too often. Keeping these two in mind, I propose that the
GCActiviationSize be applied to the memory freed thus far rather than
applying it to the current Spillable's memory size and to set a flag on
when this size is reached and invoke GC only once per handler
invocation.
Also I would like to use the following defaults if it is reasonable:
// if we freed at least this much, invoke GC
// (default 40 MB - this can be overridden by user supplied
property)
private static long gcActivationSize = 40000000L ;
// spill file size should be at least this much
// (default 5MB - this can be overridden by user supplied property)
private static long spillFileSizeThreshold = 5000000L ;
// fraction of biggest heap for which we want to get
// "memory usage threshold exceeded" notifications
private static double memoryThresholdFraction = 0.7;
// fraction of biggest heap for which we want to get
// "collection threshold exceeded" notifications
private static double collectionMemoryThresholdFraction = 0.5;
I am currently running more tests to check if previously seen issues
with queries are now solved with these changes.
-Pradeep
-----Original Message-----
From: pi song [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 11, 2008 7:15 AM
To: [email protected]
Subject: Re: Propsoal for handling "GC overhead limit" errors
Sorry. It's actually Long.MAX_VALUE, not Integer.
On Thu, Jun 12, 2008 at 12:12 AM, pi song <[EMAIL PROTECTED]> wrote:
> Pradeep,
>
> I totally buy your biggest_heap*0.7 idea.
>
> BUT!!, I've tried this:-
>
> for(int i=0;i<100000;i++) {
> StringBuilder sb = new StringBuilder() ;
> for(int j=0;j<100;j++) {
> sb.append("hodgdfdsfsddf") ;
> }
> System.gc();
> }
> And it doesn't give me any error. So I think calling too often is not
a
> problem except it might be slow.
>
> GCActiviationSize by default is set to Integer.MAX_VALUE. I believe
most
> people have never used. So, it should have nothing to do with the
current
> problem.
>
> My concern about using soft/weak reference for data in bag is that if
the
> granularity is too fine, we will need more space for those additional
> pointers.
>
> Pi
>
>
> On Wed, Jun 11, 2008 at 5:51 AM, Mridul Muralidharan <
> [EMAIL PROTECTED]> wrote:
>
>>
>>
>> Ideally, instead of using SpillableMemoryManager, it might be better
to -
>>
>> a) use soft/weak reference to refer to the data in a bag/tuple.
>> a.1) soft reference since it is less gc sensitive as compared to weak
>> reference (a gc kicks all weak ref's out typically). So soft ref's
are sort
>> of like a cache which are not so frequently kicked.
>> b) register them with reference queue and manage the life cycle of
>> referrent (to spill/not spill).
>> ) override get/put in bag/tuple such that we load off the disk if
the
>> referrent is null (this should already be done in some way in the
code
>> currently).
>>
>>
>> Ofcourse, this is much more work and is slightly more tricky ... so
if
>> SpillablyMemoryManager can handle the requirements, it should work
fine.
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>> Pradeep Kamath wrote:
>>
>>> Hi,
>>>
>>>
>>> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
>>>
>>>
>>> 1) We use MemoryManagement interface to get notified when the
>>> "collection threshold" exceeds a limit (we set this to
>>> biggest_heap*0.5). With this in place we are still seeing "GC
overhead
>>> limit" issues when trying large dataset operations. Observing some
runs,
>>> it looks like the notification is not frequent enough and early
enough
>>> to prevent memory issues possibly because this notification only
occurs
>>> after GC.
>>>
>>>
>>> 2) We only attempt to free upto :
>>>
>>> long toFree = info.getUsage().getUsed() -
>>> (long)(info.getUsage().getMax()*.5);
>>>
>>> This is only the excess amount over the threshold which caused the
>>> notification and is not sufficient to not be called again soon.
>>>
>>>
>>> 3) While iterating over spillables, if current spillable's memory
size
>>> is > gcActivationSize, we try to invoke System.gc
>>>
>>>
>>> 4) We *always* invoke System.gc() after iterating over spillables
>>>
>>>
>>> Proposed changes are:
>>>
>>> =================
>>>
>>> 1) In addition to "collection threshold" of biggest_heap*0.5, a
"usage
>>> threshold" of biggest_heap*0.7 will be used so we get notified early
and
>>> often irrespective of whether garbage collection has occured.
>>>
>>>
>>> 2) We will attempt to free
>>> toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
>>> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
>>> handleNotification() method is handling a "usage threshold exceeded"
>>> notification and (info.getUsage().getMax() * 0.5) otherwise
("collection
>>> threshold exceeded" case)
>>>
>>>
>>> 3) While iterating over spillables, if the *memory freed thus far*
is >
>>> gcActivationSize OR if we have freed sufficient memory (based on 2)
>>> above), then we set a flag to invoke System.gc when we exit the
loop.
>>>
>>> 4) We will invoke System.gc() only if the flag is set in 3) above
>>>
>>>
>>> Please provide thoughts/comments.
>>>
>>>
>>> Thanks,
>>>
>>> Pradeep
>>>
>>>
>>>
>>
>