Ideally, instead of using SpillableMemoryManager, it might be better to -
a) use soft/weak reference to refer to the data in a bag/tuple.
a.1) soft reference since it is less gc sensitive as compared to weak
reference (a gc kicks all weak ref's out typically). So soft ref's are
sort of like a cache which are not so frequently kicked.
b) register them with reference queue and manage the life cycle of
referrent (to spill/not spill).
) override get/put in bag/tuple such that we load off the disk if the
referrent is null (this should already be done in some way in the code
currently).
Ofcourse, this is much more work and is slightly more tricky ... so if
SpillablyMemoryManager can handle the requirements, it should work fine.
Regards,
Mridul
Pradeep Kamath wrote:
Hi,
Currently in org.apache.pig.impl.util.SpillableMemoryManger:
1) We use MemoryManagement interface to get notified when the
"collection threshold" exceeds a limit (we set this to
biggest_heap*0.5). With this in place we are still seeing "GC overhead
limit" issues when trying large dataset operations. Observing some runs,
it looks like the notification is not frequent enough and early enough
to prevent memory issues possibly because this notification only occurs
after GC.
2) We only attempt to free upto :
long toFree = info.getUsage().getUsed() -
(long)(info.getUsage().getMax()*.5);
This is only the excess amount over the threshold which caused the
notification and is not sufficient to not be called again soon.
3) While iterating over spillables, if current spillable's memory size
is > gcActivationSize, we try to invoke System.gc
4) We *always* invoke System.gc() after iterating over spillables
Proposed changes are:
=================
1) In addition to "collection threshold" of biggest_heap*0.5, a "usage
threshold" of biggest_heap*0.7 will be used so we get notified early and
often irrespective of whether garbage collection has occured.
2) We will attempt to free
toFree = info.getUsage().getUsed() - threshold + (long)(threshold *
0.5); where threshold is (info.getUsage().getMax() * 0.7) if the
handleNotification() method is handling a "usage threshold exceeded"
notification and (info.getUsage().getMax() * 0.5) otherwise ("collection
threshold exceeded" case)
3) While iterating over spillables, if the *memory freed thus far* is >
gcActivationSize OR if we have freed sufficient memory (based on 2)
above), then we set a flag to invoke System.gc when we exit the loop.
4) We will invoke System.gc() only if the flag is set in 3) above
Please provide thoughts/comments.
Thanks,
Pradeep