[
https://issues.apache.org/jira/browse/PIG-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated PIG-4847:
------------------------------------
Attachment: PIG-4847-1.patch
Changes done:
SpillabeMemoryManager:
- Set the collection and usage threshold to 70% of old gen or (old gen size
- 350 MB). This would avoid unnecessary spills with bigger heaps. Previously
collection threshold was set at 50% which was causing unnecessary spills.
Especially in Tez with multiple inputs and outputs, the sort buffers
(io.sort.mb) can take up a lot of space. For eg: One user had 2G heap
configured and io.sort.mb 896. Spill was triggered around 1G lot of times
because the 896MB sort buffer cannot be GCed and collection threshold was hit
way too often.
POPartialAgg:
- For the same case above with thresholds - Primary: 170629. Secondary:
28438, due to eary trigger of spills there would only < 1000 entries in primary
before POPartialAgg.spill() is invoked. Secondary value stayed around 20K and
so aggregation was very inefficient
- Avoided running through valuePlans if it was only single tuple.
- Update processedMap inplace instead of creating another hashmap of same
size.
> POPartialAgg processing and spill improvements
> ----------------------------------------------
>
> Key: PIG-4847
> URL: https://issues.apache.org/jira/browse/PIG-4847
> Project: Pig
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4847-1.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)