[ 
https://issues.apache.org/jira/browse/PIG-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4847:
------------------------------------
    Attachment: PIG-4847-1.patch

Changes done:
SpillabeMemoryManager:
    - Set the collection and usage threshold to 70% of old gen or (old gen size 
- 350 MB). This would avoid unnecessary spills with bigger heaps. Previously 
collection threshold was set at 50% which was causing unnecessary spills. 
Especially in Tez with multiple inputs and outputs, the sort buffers 
(io.sort.mb) can take up a lot of space. For eg: One user had 2G heap 
configured and io.sort.mb 896. Spill was triggered around 1G lot of times 
because the 896MB sort buffer cannot be GCed and collection threshold was hit 
way too often.

POPartialAgg:
   - For the same case above with thresholds - Primary: 170629. Secondary: 
28438, due to eary trigger of spills there would only < 1000 entries in primary 
before POPartialAgg.spill() is invoked. Secondary value stayed around 20K and 
so aggregation was very inefficient
   - Avoided running through valuePlans if it was only single tuple.
   - Update processedMap inplace instead of creating another hashmap of same 
size.

> POPartialAgg processing and spill improvements
> ----------------------------------------------
>
>                 Key: PIG-4847
>                 URL: https://issues.apache.org/jira/browse/PIG-4847
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.16.0
>
>         Attachments: PIG-4847-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to