[
https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586385#action_12586385
]
Pi Song commented on PIG-176:
-----------------------------
Based on the fact that now we spill big bags first, my observation is that
there are still cases where a big container bag is spilled and therefore its
mContent becomes empty but most of its inner bags' WeakReferences aren't
clean-up by GC yet. In such cases, if we haven't freed up enough memory, those
inner bags will be unnecessarily spilled (however all their contents were
already spilled in the big bag spill). Possibly that are 2 simple ways to solve
this:-
1) In SpillableMemoryManager, we try putting Thread.yield() in between each
spill. This should allow some more time for GC to do more clean-up without
degrading performance too much. However, if the main execution thread doesn't
produce any bag (e.g. a map task where all keys and values are tuples and
atomic data), this will give more time to the main execution thread to use up
more memory more quickly.
2) Check the size of the current spillable being spilled. If it is larger than
constant X, do a System.GC(). This is safer than (1) but due to the fact that
we explicitly call GC more often, it may have some impact on performance.
However, by considering the fact that spilling small files is much slower than
doing System.GC(), this approach should then generally give a better
performance.
I don't really have a processing task that incurs spilling that much. Can
anyone please try (2) out?
> pig creates many small files when it spills
> -------------------------------------------
>
> Key: PIG-176
> URL: https://issues.apache.org/jira/browse/PIG-176
> Project: Pig
> Issue Type: Bug
> Reporter: Olga Natkovich
> Assignee: Alan Gates
>
> Currently, on spill pig can generate millions of small (under 128K) files.
> Partially this is due to PIG-170 but even with that patch, you can still try
> and spill small bags.
> The proposal is to not spill small files. Alan told me that the logic is
> already there but we just need to bump the size limit.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.