[jira] Commented: (PIG-176) pig creates many small files when it spills

Pi Song (JIRA) Mon, 07 Apr 2008 06:41:44 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586385#action_12586385
 ]


Pi Song commented on PIG-176:
-----------------------------

Based on the fact that now we spill big bags first, my observation is that 
there are still cases where a big container bag is spilled and therefore its 
mContent becomes empty but most of its inner bags' WeakReferences aren't 
clean-up by GC yet. In such cases, if we haven't freed up enough memory, those 
inner bags will be unnecessarily spilled (however all their contents were 
already spilled in the big bag spill). Possibly that are 2 simple ways to solve 
this:- 

1) In SpillableMemoryManager, we try putting Thread.yield() in between each 
spill. This should allow some more time for GC to do more clean-up without 
degrading performance too much. However, if the main execution thread doesn't 
produce any bag (e.g. a map task where all keys and values are tuples and 
atomic data), this will give more time to the main execution thread to use up 
more memory more quickly.

2) Check the size of the current spillable being spilled. If it is larger than 
constant X, do a System.GC(). This is safer than (1) but due to the fact that 
we explicitly call GC more often, it may have some impact on performance. 
However, by considering the fact that spilling small files is much slower than 
doing System.GC(), this approach should then generally give a better 
performance.

I don't really have a processing task that incurs spilling that much. Can 
anyone please try (2) out?

> pig creates many small files when it spills
> -------------------------------------------
>
>                 Key: PIG-176
>                 URL: https://issues.apache.org/jira/browse/PIG-176
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
>
> Currently, on spill pig can generate millions of small (under 128K) files. 
> Partially this is due to PIG-170 but even with that patch, you can still try 
> and spill small bags.
> The proposal is to not spill small files. Alan told me that the logic is 
> already there but we just need to bump the size limit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-176) pig creates many small files when it spills

Reply via email to