[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131213#comment-13131213
 ] 

He Yongqiang commented on MAPREDUCE-1639:
-----------------------------------------

something that may help get more discussions:

we are trying to experiment on a new output collector. Here are some of our 
thoughts:
1) group key, value in a memoryblock which is basically start-end pointer to 
the big kvbuffer. One memoryblock must belong to one reducer.
2) use quicksort to sort data in memoryblock
3) use binaryinsert sort when doing spill. in this phase, since memoryblocks 
are grouped by reducer already, so this will not sort memoryblocks across 
reducers

On group by and join, if not enforced sorting, only a grouping is needed, no 
need for a global sort.
On the mapper side:
have an hashtable for memoryblocks. And use the hash to decide which memory 
block to go. this will only help reduce number of sort across memoryblocks, but 
will not eliminate them. This is because of memory constrain(in which case we 
need to borrow memory from other memory block) and collision.

On the reduce side:
all mappers are applying the same rule, so can add some metadata for each 
mapper's output to help reducer side decide whether or not need to compare.

                
> Grouping using hashing instead of sorting
> -----------------------------------------
>
>                 Key: MAPREDUCE-1639
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1639
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Joydeep Sen Sarma
>
> most applications of map-reduce care about grouping and not sorting. Sorting 
> is a (relatively expensive) way to achieve grouping. In order to achieve just 
> grouping - one can:
> - replace the sort on the Mappers with a HashTable - and maintain lists of 
> key-values against each hash-bucket.
> - key-value tuples inside each hash bucket are sorted - before spilling or 
> sending to Reducer. Anytime this is done - Combiner can be invoked.
> - HashTable is serialized by hash-bucketid. So merges (of either spills or 
> Map Outputs) works similar to today (at least there's no change in overall 
> compute complexity of merge)
> Of course this hashtable has nothing to do with partitioning. it's just a 
> replacement for map-side sort.
> --
> this is (pretty much) straight from the MARS project paper: 
> http://www.cse.ust.hk/catalac/papers/mars_pact08.pdf. They report a 45% 
> speedup in inverted index calculation using hashing instead of sorting 
> (reference implementation is NOT against Hadoop though).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to