[jira] [Commented] (PIG-2228) support partial aggregation in map task

[email protected] (JIRA) Tue, 13 Sep 2011 16:26:38 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104076#comment-13104076
 ]


[email protected] commented on PIG-2228:
----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1817/#review1877
-----------------------------------------------------------



trunk/src/org/apache/pig/Main.java
<https://reviews.apache.org/r/1817/#comment4318>

    will do. I will set the properties to the default value (if any), and ask 
the user to use "-h properties" command for description of the properties. (I 
don't want to increase the sources of truth for the details of the properties).



trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
<https://reviews.apache.org/r/1817/#comment4319>

    Will do. For future -  it might be a good idea to generate these settings 
for pig code conventions as well with the eclipse-files target.



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4320>

    fixing!



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4336>

    will do some refactoring.
    



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4322>

    Yes, the mapDumpIteral will be non null if there were accumulated entries 
in the map.
    
    Will move the disableMapAgg out to improve readability. 
    



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4324>

    The return type is a Result object, STATUS_EOP is a byte. I don't want to 
create a new object. It might be useful to create 'static final' Result objects 
for these cases to improve readability. (Not doing that as part of this patch.)
    
    



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4325>

    described in comment where valueTuple is defined. 
        // tuple of the format - (null(key),bag-val1,bag-val2,...)
        // attach this to the plans with algebraic udf before evaluating the 
plans
        private transient Tuple valueTuple = null;
    
    The combiner plan is used by POPartialAgg without resetting the input 
offsets, that is why the first field has the key.



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4326>

    changing. I thought a check was being done at pig query compile time, but 
apparently thats not the case. Created- PIG-2283



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4327>

    existingResult is the tuple that holds key and value(s), if the key is 
present in aggMap, it will never be null. I didn't want to make two hashmap 
function calls (.contains() + .get()). 
    I am adding a comment to clarify.



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4328>

    this is the output of partial aggregation. it is the result of getOutput() 
function, which returns only ERR_RESULT or Result with STATUS_OK. Getting a 
status such as STATUS_NULL ,STATUS_EOP or STATUS_BATCH_OK does not make sense, 
and should result in an error (the query fails).
    



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4329>

    changing. i have re-formatted this whole file using eclipse, and its fixed 
all this.



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4330>

    Thanks for the suggesting it, that seems to be a good candidate. But I 
think that it is something that is better explored in a separate jira.
    It certainly needs a benchmark run. It might be important to select more 
'realistic' data to see how useful LRU/LFU replacement is going to be. I am not 
sure if pigmix is representative of data where LRU/LFU will be useful.
    
    The memory footprint of LinkedHashMap would be higher. The memory usage 
estimation logic also would need to change as it is based on sizes for HashMap. 
    



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4331>

    yeah, we do have large screens with awesome resolution ;) 
    But I will fix the excess newlines for the less fortunate :)
    



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4332>

    will create a public static final variable in POPartAgg for the property. 
But I think it would be good to define it in some central place with other 
properties. But we don't have such a place now.
    Will also use getInt.
    



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4333>

    changing.
    



trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
<https://reviews.apache.org/r/1817/#comment4334>

    the cast here is actually unnecessary, removing it.


- Thejas


On 2011-09-12 23:55:12, Thejas Nair wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/1817/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-12 23:55:12)
bq.  
bq.  
bq.  Review request for pig, Daniel Dai and Dmitriy Ryaboy.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  See PIG-2228
bq.  
bq.  
bq.  This addresses bug PIG-2228.
bq.      https://issues.apache.org/jira/browse/PIG-2228
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/src/org/apache/pig/Algebraic.java 1164722 
bq.    trunk/src/org/apache/pig/Main.java 1164722 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
 1164722 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
 1164722 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java
 1164722 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java
 1164722 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java
 1164722 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java
 1164722 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java
 1164722 
bq.    
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
 PRE-CREATION 
bq.    trunk/src/org/apache/pig/data/DefaultTuple.java 1164722 
bq.    trunk/src/org/apache/pig/data/InternalCachedBag.java 1164722 
bq.    trunk/src/org/apache/pig/data/InternalDistinctBag.java 1164722 
bq.    trunk/src/org/apache/pig/data/InternalSortedBag.java 1164722 
bq.    trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION 
bq.    trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION 
bq.    trunk/src/org/apache/pig/data/SortedSpillBag.java 1164722 
bq.    trunk/test/e2e/pig/tests/nightly.conf 1164722 
bq.    trunk/test/org/apache/pig/test/TestDataBag.java 1164722 
bq.    trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION 
bq.    trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION 
bq.    trunk/test/org/apache/pig/test/Util.java 1164722 
bq.    trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1164722 
bq.  
bq.  Diff: https://reviews.apache.org/r/1817/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  test-patch 
bq.       [exec] -1 overall.
bq.       [exec]
bq.       [exec]     +1 @author.  The patch does not contain any @author tags.
bq.       [exec]
bq.       [exec]     +1 tests included.  The patch appears to include 21 new or 
modified tests.
bq.       [exec]
bq.       [exec]     +1 javadoc.  The javadoc tool did not generate any warning 
messages.
bq.       [exec]
bq.       [exec]     +1 javac.  The applied patch does not increase the total 
number of javac compiler warnings.
bq.       [exec]
bq.       [exec]     +1 findbugs.  The patch does not introduce any new 
Findbugs warnings.
bq.       [exec]
bq.       [exec]     -1 release audit.  The applied patch generated 461 release 
audit warnings (more than the trunk's current 455 warnings).
bq.  release audit failures are because of jdiff changes
bq.  
bq.  All  unit tests pass, new e2e tests added .
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Thejas
bq.  
bq.



> support partial aggregation in map task
> ---------------------------------------
>
>                 Key: PIG-2228
>                 URL: https://issues.apache.org/jira/browse/PIG-2228
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.10
>
>         Attachments: PIG-2228.1.patch, PIG-2228.2.patch, PIG-2228.3.patch, 
> PIG-2228.4.patch, PIG-2228.5.patch
>
>
> h3. Introduction
> Pig does (sort based) partial aggregation in map side through the use of 
> combiner. MR serializes the output of map to a buffer, sorts it on the keys, 
> deserializes and passes the values grouped on the keys to combiner phase. The 
> same work of combiner can be done in the map phase itself by using a hash-map 
> on the keys. This hash based (partial) aggregation can be done with or 
> without a combiner phase.
> h3. Benefits
> It will send fewer records to combiner and thereby -
>   * Save on cost of serializing and de-serializing
>   * Save on cost of lock calls on the combiner input buffer. (I have found 
> this to be a significant cost for a query that was doing multiple group-by's 
> in a single MR job. -Thejas) 
>   * The problem of running out of memory in reduce side, for queries like 
> COUNT(distinct col) can be avoided. The OOM issue happens because very large 
> records get created after the combiner run on merged reduce input. In case of 
> combiner, you have no way of telling MR not to combine records in reduce 
> side. The workaround is to disable combiner completely, and the opportunity 
> to reduce map output size is lost.
>   * When the foreach after group-by has both algebraic and non-algebraic 
> functions, or if a bag is being projected, the combiner is not used. This is 
> because the data size reduction in typical cases are not significant enough 
> to justify the additional (de)serialization costs. But hash based aggregation 
> can be used in such cases as well.
>   * It is possible to turn off the in-map combine automatically if there is 
> not enough 'combination' that is taking place to justify the overhead of the 
> in-map combiner. (Idea borrowed from Hive jira.) 
>   * If input data is sorted, it is possible to do efficient map side 
> (partial) aggregation with in-map combiner.
> Design proposal is here - 
> https://cwiki.apache.org/confluence/display/PIG/PigInMapCombinerProposal

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2228) support partial aggregation in map task

Reply via email to