[ 
https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029973#comment-14029973
 ] 

David Dreyfus commented on PIG-3979:
------------------------------------

Hi Philip,

Like yourself, I entered the murky world of spillable aggregates because I had 
jobs that took forever or died (seppuku). I noticed the problem was triggered 
by the 20K record read. I am no expert on this code, so really can't offer a 
lot of suggestions on how to fix excessive spilling and a battle between GC and 
spilling. I'm glad the patch helped you.

I have no objection to your change to use the smaller of 20K records or a 
certain amount of memory, though I don't know what it would solve. Once the 
memory constraint is reached, the records get reduced. If this means 200K 
records are read, so what.

POPartialAgg.getMemorySize() will return 0 whenever all the records to read fit 
into the memory allotted. This will be the case with or without the patch. The 
only thing the patch does is make the number of records read proportional to 
the the size of the record and the amount of memory allocated to the task.

I also played with the GC issues but didn't find a good solution that I 
understood well enough. Spilling seems to be triggered by the GC running. If 
spilling is triggered by GC and spilling tries to allocate memory which 
triggers more GC, we have a death spiral. If we could disable GC while 
spilling, we might avoid death by GC. There really aught to be a better method 
than keeping memory utilization to a minimum. Solving the GC problem would be a 
coup.



> group all performance, garbage collection, and incremental aggregation
> ----------------------------------------------------------------------
>
>                 Key: PIG-3979
>                 URL: https://issues.apache.org/jira/browse/PIG-3979
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.12.0, 0.11.1
>            Reporter: David Dreyfus
>            Assignee: David Dreyfus
>             Fix For: 0.14.0
>
>         Attachments: PIG-3979-v1.patch, POPartialAgg.java
>
>
> I have a PIG statement similar to:
> summary = foreach (group data ALL) generate 
> COUNT(data.col1), SUM(data.col2), SUM(data.col2)
> , Moments(col3)
> , Moments(data.col4)
> There are a couple of hundred columns.
> I set the following:
> SET pig.exec.mapPartAgg true;
> SET pig.exec.mapPartAgg.minReduction 3;
> SET pig.cachedbag.memusage 0.05;
> I found that when I ran this on a JVM with insufficient memory, the process 
> eventually timed out because of an infinite garbage collection loop.
> The problem was invariant to the memusage setting.
> I solved the problem by making changes to:
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java
> Rather than reading in 10000 records to establish an estimate of the 
> reduction, I make an estimate after reading in enough tuples to fill 
> pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().
> I also made a change to guarantee at least one record allowed in second tier 
> storage. In the current implementation, if the reduction is very high 1000:1, 
> space in second tier storage is zero.
> With these changes, I can summarize large data sets with small JVMs. I also 
> find that setting pig.cachedbag.memusage to a small number such as 0.05 
> results in much better garbage collection performance without reducing 
> throughput. I suppose tuning GC would also solve a problem with excessive 
> garbage collection.
> The performance is sweet. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to