[ 
https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900939#action_12900939
 ] 

Thejas M Nair commented on PIG-1447:
------------------------------------

Some more reasons why higher value would still be safe -
1. A lot of the memory attributed to the InternalDistinct/InternalSorted bags 
used from within nested-foreach will be shared with the InternalCacheBag in the 
input tuple because the pig does not create a copy of the column objects.
2. In a nested foreach,  at a time only one inner-plan will hold references to 
the Internal* bags . The internal* bags are eventually converted to 
DefaultDataBag by RelationToExpressionProject in these plans. In most common 
cases (say you are generating multiple-count distincts, order-bys on bags in 
nested foreach), that means only one Internal* bag created within nested 
foreach will be referenced at a time. I tried comparing the memory footprint 
with different number of distinct operations in a nested-foreach, and found 
them to be in same range.
I am planning to set the default at 20% for now. If we find the memory limits 
being hit as a result of this during the beta testing period, we can reduce the 
default.


> Tune memory usage of InternalCachedBag
> --------------------------------------
>
>                 Key: PIG-1447
>                 URL: https://issues.apache.org/jira/browse/PIG-1447
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to