[
https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168538#comment-14168538
]
Daniel Dai commented on PIG-3979:
---------------------------------
I don't have a test case, in theory, this could happen when we have memory
stress, the actual spill code uses more memory so could trigger constant GC.
But I agree to keep the code simple for now. We need more research to figure
out if we can stop additional GC within the spill, and there are additional
complexity to make POPartialAgg multi-thread safe.
There are only 2 minor comments left in RB, can you take a look?
> group all performance, garbage collection, and incremental aggregation
> ----------------------------------------------------------------------
>
> Key: PIG-3979
> URL: https://issues.apache.org/jira/browse/PIG-3979
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Affects Versions: 0.12.0, 0.11.1
> Reporter: David Dreyfus
> Assignee: David Dreyfus
> Fix For: 0.14.0
>
> Attachments: PIG-3979-3.patch, PIG-3979-v1.patch,
> POPartialAgg.java.patch, SpillableMemoryManager.java.patch
>
>
> I have a PIG statement similar to:
> summary = foreach (group data ALL) generate
> COUNT(data.col1), SUM(data.col2), SUM(data.col2)
> , Moments(col3)
> , Moments(data.col4)
> There are a couple of hundred columns.
> I set the following:
> SET pig.exec.mapPartAgg true;
> SET pig.exec.mapPartAgg.minReduction 3;
> SET pig.cachedbag.memusage 0.05;
> I found that when I ran this on a JVM with insufficient memory, the process
> eventually timed out because of an infinite garbage collection loop.
> The problem was invariant to the memusage setting.
> I solved the problem by making changes to:
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java
> Rather than reading in 10000 records to establish an estimate of the
> reduction, I make an estimate after reading in enough tuples to fill
> pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().
> I also made a change to guarantee at least one record allowed in second tier
> storage. In the current implementation, if the reduction is very high 1000:1,
> space in second tier storage is zero.
> With these changes, I can summarize large data sets with small JVMs. I also
> find that setting pig.cachedbag.memusage to a small number such as 0.05
> results in much better garbage collection performance without reducing
> throughput. I suppose tuning GC would also solve a problem with excessive
> garbage collection.
> The performance is sweet.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)