Sorry. It's actually Long.MAX_VALUE, not Integer.
On Thu, Jun 12, 2008 at 12:12 AM, pi song <[EMAIL PROTECTED]> wrote:
> Pradeep,
>
> I totally buy your biggest_heap*0.7 idea.
>
> BUT!!, I've tried this:-
>
> for(int i=0;i<100000;i++) {
> StringBuilder sb = new StringBuilder() ;
> for(int j=0;j<100;j++) {
> sb.append("hodgdfdsfsddf") ;
> }
> System.gc();
> }
> And it doesn't give me any error. So I think calling too often is not a
> problem except it might be slow.
>
> GCActiviationSize by default is set to Integer.MAX_VALUE. I believe most
> people have never used. So, it should have nothing to do with the current
> problem.
>
> My concern about using soft/weak reference for data in bag is that if the
> granularity is too fine, we will need more space for those additional
> pointers.
>
> Pi
>
>
> On Wed, Jun 11, 2008 at 5:51 AM, Mridul Muralidharan <
> [EMAIL PROTECTED]> wrote:
>
>>
>>
>> Ideally, instead of using SpillableMemoryManager, it might be better to -
>>
>> a) use soft/weak reference to refer to the data in a bag/tuple.
>> a.1) soft reference since it is less gc sensitive as compared to weak
>> reference (a gc kicks all weak ref's out typically). So soft ref's are sort
>> of like a cache which are not so frequently kicked.
>> b) register them with reference queue and manage the life cycle of
>> referrent (to spill/not spill).
>> ) override get/put in bag/tuple such that we load off the disk if the
>> referrent is null (this should already be done in some way in the code
>> currently).
>>
>>
>> Ofcourse, this is much more work and is slightly more tricky ... so if
>> SpillablyMemoryManager can handle the requirements, it should work fine.
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>> Pradeep Kamath wrote:
>>
>>> Hi,
>>>
>>>
>>> Currently in org.apache.pig.impl.util.SpillableMemoryManger:
>>>
>>>
>>> 1) We use MemoryManagement interface to get notified when the
>>> "collection threshold" exceeds a limit (we set this to
>>> biggest_heap*0.5). With this in place we are still seeing "GC overhead
>>> limit" issues when trying large dataset operations. Observing some runs,
>>> it looks like the notification is not frequent enough and early enough
>>> to prevent memory issues possibly because this notification only occurs
>>> after GC.
>>>
>>>
>>> 2) We only attempt to free upto :
>>>
>>> long toFree = info.getUsage().getUsed() -
>>> (long)(info.getUsage().getMax()*.5);
>>>
>>> This is only the excess amount over the threshold which caused the
>>> notification and is not sufficient to not be called again soon.
>>>
>>>
>>> 3) While iterating over spillables, if current spillable's memory size
>>> is > gcActivationSize, we try to invoke System.gc
>>>
>>>
>>> 4) We *always* invoke System.gc() after iterating over spillables
>>>
>>>
>>> Proposed changes are:
>>>
>>> =================
>>>
>>> 1) In addition to "collection threshold" of biggest_heap*0.5, a "usage
>>> threshold" of biggest_heap*0.7 will be used so we get notified early and
>>> often irrespective of whether garbage collection has occured.
>>>
>>>
>>> 2) We will attempt to free
>>> toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
>>> 0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
>>> handleNotification() method is handling a "usage threshold exceeded"
>>> notification and (info.getUsage().getMax() * 0.5) otherwise ("collection
>>> threshold exceeded" case)
>>>
>>>
>>> 3) While iterating over spillables, if the *memory freed thus far* is >
>>> gcActivationSize OR if we have freed sufficient memory (based on 2)
>>> above), then we set a flag to invoke System.gc when we exit the loop.
>>>
>>> 4) We will invoke System.gc() only if the flag is set in 3) above
>>>
>>>
>>> Please provide thoughts/comments.
>>>
>>>
>>> Thanks,
>>>
>>> Pradeep
>>>
>>>
>>>
>>
>