[ https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900939#action_12900939 ]
Thejas M Nair commented on PIG-1447: ------------------------------------ Some more reasons why higher value would still be safe - 1. A lot of the memory attributed to the InternalDistinct/InternalSorted bags used from within nested-foreach will be shared with the InternalCacheBag in the input tuple because the pig does not create a copy of the column objects. 2. In a nested foreach, at a time only one inner-plan will hold references to the Internal* bags . The internal* bags are eventually converted to DefaultDataBag by RelationToExpressionProject in these plans. In most common cases (say you are generating multiple-count distincts, order-bys on bags in nested foreach), that means only one Internal* bag created within nested foreach will be referenced at a time. I tried comparing the memory footprint with different number of distinct operations in a nested-foreach, and found them to be in same range. I am planning to set the default at 20% for now. If we find the memory limits being hit as a result of this during the beta testing period, we can reduce the default. > Tune memory usage of InternalCachedBag > -------------------------------------- > > Key: PIG-1447 > URL: https://issues.apache.org/jira/browse/PIG-1447 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.7.0 > Reporter: Daniel Dai > Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch > > > We need to find a better value for "pig.cachedbag.memusage". -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.