[jira] [Updated] (PIG-3979) group all performance, garbage collection, and incremental aggregation

David Dreyfus (JIRA) Mon, 16 Jun 2014 05:46:24 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Dreyfus updated PIG-3979:
-------------------------------

    Attachment: SpillableMemoryManager.java.patch
                POPartialAgg.java.patch

Updates to attachments:
SpillableMemoryManager - Made sorting stable to avoid 
java.lang.IllegalArgumentException: Comparison method violates its general 
contract! (see JIRA-4012). Eliminated calls to System.gc() from within the gc 
notification handler. This seems to eliminate the gc death spiral. Since gc() 
is supposed to be just a hint, and because calling it would lead to a new 
notification call, the hope is that removing this won't cause problems for use 
cases when the spillable that is being called isn't POPartialAgg.

POPartialAgg. -
Made sure we have a default memory size.
Made it so the SpillableMemoryManager triggers an aggregation pass that may yet 
spill, but doesn't always force a spill.

In both, set some of the log messages to debug level from info level.

> group all performance, garbage collection, and incremental aggregation
> ----------------------------------------------------------------------
>
>                 Key: PIG-3979
>                 URL: https://issues.apache.org/jira/browse/PIG-3979
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.12.0, 0.11.1
>            Reporter: David Dreyfus
>            Assignee: David Dreyfus
>             Fix For: 0.14.0
>
>         Attachments: PIG-3979-v1.patch, POPartialAgg.java.patch, 
> SpillableMemoryManager.java.patch
>
>
> I have a PIG statement similar to:
> summary = foreach (group data ALL) generate 
> COUNT(data.col1), SUM(data.col2), SUM(data.col2)
> , Moments(col3)
> , Moments(data.col4)
> There are a couple of hundred columns.
> I set the following:
> SET pig.exec.mapPartAgg true;
> SET pig.exec.mapPartAgg.minReduction 3;
> SET pig.cachedbag.memusage 0.05;
> I found that when I ran this on a JVM with insufficient memory, the process 
> eventually timed out because of an infinite garbage collection loop.
> The problem was invariant to the memusage setting.
> I solved the problem by making changes to:
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java
> Rather than reading in 10000 records to establish an estimate of the 
> reduction, I make an estimate after reading in enough tuples to fill 
> pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().
> I also made a change to guarantee at least one record allowed in second tier 
> storage. In the current implementation, if the reduction is very high 1000:1, 
> space in second tier storage is zero.
> With these changes, I can summarize large data sets with small JVMs. I also 
> find that setting pig.cachedbag.memusage to a small number such as 0.05 
> results in much better garbage collection performance without reducing 
> throughput. I suppose tuning GC would also solve a problem with excessive 
> garbage collection.
> The performance is sweet. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PIG-3979) group all performance, garbage collection, and incremental aggregation

Reply via email to