Add hash aggregation style data flow and/or new API
---------------------------------------------------
Key: MAPREDUCE-3247
URL: https://issues.apache.org/jira/browse/MAPREDUCE-3247
Project: Hadoop Map/Reduce
Issue Type: New Feature
Components: task
Affects Versions: 0.23.0
Reporter: Binglin Chang
In many join/aggregation like queries run on top of mapreduce, sort is not
need, in fact a hash table based join/aggregation is more efficient, this is
described in "Tenzing A SQL Implementation On The MapReduce Framework" in
detail. There are two ways to support hash table based join/aggregation in
hadoop mapreduce:
# Only support no sort, the framework do nothing, just pass partitioned k/v
pair from mapper to reducer
The upper application use hash table in their mapper & reducer to do
aggregation, and emit all hashtable enties in cleanup() of mapper/reducer, this
is how Google did in Tenzing. The main problem is memory control of hashtable.
# Add new "fold" API, it can coexist with combiner/reducer API, user can use
mapper-combiner-reducer or "mapper-folder" (maybe a bad name, welcome to
propose a better name..)
Like foldl in functional programming: folder should have the semantic:
foldl folder z (x:xs) = foldl folder (folder z x) xs
In this way, upper applications only need to provide folder, underlying
framework create and maintains hashtable for key/value pairs, it can be managed
& optimized by the framework. For example, in mapper side, we can pre emit
entire hashtable or use some policies like cache algorithm to emit part of k/v
pairs to free some memory, if the memory consumption reach io.sort.mb
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira