[ 
https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177063#comment-14177063
 ] 

Rohini Palaniswamy commented on PIG-3979:
-----------------------------------------

[~ddreyfus],

bq. For counting spilled bytes, I rely on the counters.
   Problem is for killed or failed tasks the counters are discarded and the log 
messages are very useful then.
 
bq. I understand how the extra GC solved PIG-3148. 
   There were two System.gc() in the SpillableMemoryManager. There was always a 
System.gc() invoked after all the spill() calls which was there from the 
beginning. Then there is an extraGC that was introduced to do a System.gc() 
before the spill() call if object size exceeds a certain threshold. You have 
removed both the System.gc() calls which I believe will lead to lot of 
regressions. It would be nice to add another API to confirm before invoking 
extra GC but that will cause backward incompatibility unless we move to Java 8. 
For now we can hack by doing instanceof POPartialAgg and not invoke extraGC. I 
am posting a patch that makes the spill in POPartialAgg synchronous. Can you 
test if it fixes your problem? 

> group all performance, garbage collection, and incremental aggregation
> ----------------------------------------------------------------------
>
>                 Key: PIG-3979
>                 URL: https://issues.apache.org/jira/browse/PIG-3979
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.12.0, 0.11.1
>            Reporter: David Dreyfus
>            Assignee: David Dreyfus
>             Fix For: 0.14.0
>
>         Attachments: PIG-3979-3.patch, PIG-3979-4.patch, PIG-3979-v1.patch, 
> POPartialAgg.java.patch, SpillableMemoryManager.java.patch
>
>
> I have a PIG statement similar to:
> summary = foreach (group data ALL) generate 
> COUNT(data.col1), SUM(data.col2), SUM(data.col2)
> , Moments(col3)
> , Moments(data.col4)
> There are a couple of hundred columns.
> I set the following:
> SET pig.exec.mapPartAgg true;
> SET pig.exec.mapPartAgg.minReduction 3;
> SET pig.cachedbag.memusage 0.05;
> I found that when I ran this on a JVM with insufficient memory, the process 
> eventually timed out because of an infinite garbage collection loop.
> The problem was invariant to the memusage setting.
> I solved the problem by making changes to:
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java
> Rather than reading in 10000 records to establish an estimate of the 
> reduction, I make an estimate after reading in enough tuples to fill 
> pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().
> I also made a change to guarantee at least one record allowed in second tier 
> storage. In the current implementation, if the reduction is very high 1000:1, 
> space in second tier storage is zero.
> With these changes, I can summarize large data sets with small JVMs. I also 
> find that setting pig.cachedbag.memusage to a small number such as 0.05 
> results in much better garbage collection performance without reducing 
> throughput. I suppose tuning GC would also solve a problem with excessive 
> garbage collection.
> The performance is sweet. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to