[ 
https://issues.apache.org/jira/browse/PIG-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103195#comment-13103195
 ] 

jirapos...@reviews.apache.org commented on PIG-2228:
----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1817/
-----------------------------------------------------------

Review request for pig, Daniel Dai and Dmitriy Ryaboy.


Summary
-------

See PIG-2228


This addresses bug PIG-2228.
    https://issues.apache.org/jira/browse/PIG-2228


Diffs
-----

  trunk/src/org/apache/pig/Algebraic.java 1164722 
  trunk/src/org/apache/pig/Main.java 1164722 
  
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/CombinerOptimizer.java
 1164722 
  
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
 1164722 
  
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PhyPlanSetter.java
 1164722 
  
trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java
 1164722 
  
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java
 1164722 
  
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PlanPrinter.java
 1164722 
  
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POLocalRearrange.java
 1164722 
  
trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartialAgg.java
 PRE-CREATION 
  trunk/src/org/apache/pig/data/DefaultTuple.java 1164722 
  trunk/src/org/apache/pig/data/InternalCachedBag.java 1164722 
  trunk/src/org/apache/pig/data/InternalDistinctBag.java 1164722 
  trunk/src/org/apache/pig/data/InternalSortedBag.java 1164722 
  trunk/src/org/apache/pig/data/SelfSpillBag.java PRE-CREATION 
  trunk/src/org/apache/pig/data/SizeUtil.java PRE-CREATION 
  trunk/src/org/apache/pig/data/SortedSpillBag.java 1164722 
  trunk/test/e2e/pig/tests/nightly.conf 1164722 
  trunk/test/org/apache/pig/test/TestDataBag.java 1164722 
  trunk/test/org/apache/pig/test/TestPOPartialAgg.java PRE-CREATION 
  trunk/test/org/apache/pig/test/TestPOPartialAggPlan.java PRE-CREATION 
  trunk/test/org/apache/pig/test/Util.java 1164722 
  trunk/test/org/apache/pig/test/utils/GenPhyOp.java 1164722 

Diff: https://reviews.apache.org/r/1817/diff


Testing
-------

test-patch 
     [exec] -1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 21 new or 
modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning 
messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
     [exec]
     [exec]     -1 release audit.  The applied patch generated 461 release 
audit warnings (more than the trunk's current 455 warnings).
release audit failures are because of jdiff changes

All  unit tests pass, new e2e tests added .


Thanks,

Thejas



> support partial aggregation in map task
> ---------------------------------------
>
>                 Key: PIG-2228
>                 URL: https://issues.apache.org/jira/browse/PIG-2228
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.10
>
>         Attachments: PIG-2228.1.patch, PIG-2228.2.patch, PIG-2228.3.patch, 
> PIG-2228.4.patch, PIG-2228.5.patch
>
>
> h3. Introduction
> Pig does (sort based) partial aggregation in map side through the use of 
> combiner. MR serializes the output of map to a buffer, sorts it on the keys, 
> deserializes and passes the values grouped on the keys to combiner phase. The 
> same work of combiner can be done in the map phase itself by using a hash-map 
> on the keys. This hash based (partial) aggregation can be done with or 
> without a combiner phase.
> h3. Benefits
> It will send fewer records to combiner and thereby -
>   * Save on cost of serializing and de-serializing
>   * Save on cost of lock calls on the combiner input buffer. (I have found 
> this to be a significant cost for a query that was doing multiple group-by's 
> in a single MR job. -Thejas) 
>   * The problem of running out of memory in reduce side, for queries like 
> COUNT(distinct col) can be avoided. The OOM issue happens because very large 
> records get created after the combiner run on merged reduce input. In case of 
> combiner, you have no way of telling MR not to combine records in reduce 
> side. The workaround is to disable combiner completely, and the opportunity 
> to reduce map output size is lost.
>   * When the foreach after group-by has both algebraic and non-algebraic 
> functions, or if a bag is being projected, the combiner is not used. This is 
> because the data size reduction in typical cases are not significant enough 
> to justify the additional (de)serialization costs. But hash based aggregation 
> can be used in such cases as well.
>   * It is possible to turn off the in-map combine automatically if there is 
> not enough 'combination' that is taking place to justify the overhead of the 
> in-map combiner. (Idea borrowed from Hive jira.) 
>   * If input data is sorted, it is possible to do efficient map side 
> (partial) aggregation with in-map combiner.
> Design proposal is here - 
> https://cwiki.apache.org/confluence/display/PIG/PigInMapCombinerProposal

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to