[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13528609#comment-13528609
 ] 

Jerry Chen commented on MAPREDUCE-3247:
---------------------------------------

Binglin, I noticed that you create this bug from MAPREDUCE-1639, while I think 
this two bugs are more or less similar. And also there are a lot other things 
related are going on such as MAPREDUCE-2454 and MAPREDUCE-4049.

If you are not working on this, I would like to take time to work on this 
feature.
                
> Add hash aggregation style data flow and/or new API
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-3247
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3247
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Binglin Chang
>              Labels: api, perfomance
>
> In many join/aggregation like queries run on top of mapreduce, sort is not 
> need, in fact a hash table based join/aggregation is more efficient, this is 
> described in "Tenzing A SQL Implementation On The MapReduce Framework" in 
> detail. There are two ways to support hash table based join/aggregation in 
> hadoop mapreduce:
> # Only support no sort, the framework do nothing, just pass partitioned k/v 
> pair from mapper to reducer
>    The upper application use hash table in their mapper & reducer to do 
> aggregation, and emit all hashtable enties in cleanup() of mapper/reducer, 
> this is how Google did in Tenzing. The main problem is memory control of 
> hashtable.
> # Add new "fold" API, it can coexist with combiner/reducer API, user can use 
> mapper-combiner-reducer or "mapper-folder" (maybe a bad name, welcome to 
> propose a better name..)
>    Like foldl in functional programming: folder should have the semantic:
>      foldl folder z (x:xs)  =   foldl folder (folder z x) xs
>    In this way, upper applications only need to provide folder, underlying 
> framework create and maintains hashtable for key/value pairs, it can be 
> managed & optimized by the framework. For example, in mapper side, we can pre 
> emit entire hashtable or use some policies like cache algorithm to emit part 
> of k/v pairs to free some memory, if the memory consumption reach io.sort.mb

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to