[jira] [Commented] (PIG-3979) group all performance, garbage collection, and incremental aggregation

David Dreyfus (JIRA) Sat, 11 Oct 2014 18:17:49 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168449#comment-14168449
 ]


David Dreyfus commented on PIG-3979:
------------------------------------

Regarding Philip (flip) Kromer's death spiral, do you have a test case that 
shows that the problem isn't resolved? It was certainly my hope that I did 
solve it. My own testing suggested it did (although I didn't have a test case 
that made it real easy to confirm). Before I made the changes, tight memory 
parameters would cause a death spiral with ease. Afterwords, not so much.

My sense is that trying to do useful work within the GC notification 
thread/handler can cause problems. By using the handler to set a flag and 
letting the main thread to do the useful work, we avoid that problem. I haven't 
researched how GC notification works. I'm not convinced that doing processing 
within the handler guarantees no further GC calls. Moreover, it's not clear 
that the POPartialAgg class is designed for access by multiple threads. 

In short, i'd keep it as simple as possible with all spill work handled by one 
spiller that operates on the main thread.



> group all performance, garbage collection, and incremental aggregation
> ----------------------------------------------------------------------
>
>                 Key: PIG-3979
>                 URL: https://issues.apache.org/jira/browse/PIG-3979
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.12.0, 0.11.1
>            Reporter: David Dreyfus
>            Assignee: David Dreyfus
>             Fix For: 0.14.0
>
>         Attachments: PIG-3979-3.patch, PIG-3979-v1.patch, 
> POPartialAgg.java.patch, SpillableMemoryManager.java.patch
>
>
> I have a PIG statement similar to:
> summary = foreach (group data ALL) generate 
> COUNT(data.col1), SUM(data.col2), SUM(data.col2)
> , Moments(col3)
> , Moments(data.col4)
> There are a couple of hundred columns.
> I set the following:
> SET pig.exec.mapPartAgg true;
> SET pig.exec.mapPartAgg.minReduction 3;
> SET pig.cachedbag.memusage 0.05;
> I found that when I ran this on a JVM with insufficient memory, the process 
> eventually timed out because of an infinite garbage collection loop.
> The problem was invariant to the memusage setting.
> I solved the problem by making changes to:
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java
> Rather than reading in 10000 records to establish an estimate of the 
> reduction, I make an estimate after reading in enough tuples to fill 
> pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().
> I also made a change to guarantee at least one record allowed in second tier 
> storage. In the current implementation, if the reduction is very high 1000:1, 
> space in second tier storage is zero.
> With these changes, I can summarize large data sets with small JVMs. I also 
> find that setting pig.cachedbag.memusage to a small number such as 0.05 
> results in much better garbage collection performance without reducing 
> throughput. I suppose tuning GC would also solve a problem with excessive 
> garbage collection.
> The performance is sweet. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-3979) group all performance, garbage collection, and incremental aggregation

Reply via email to