[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029596#comment-13029596
 ] 

Mariappan Asokan commented on MAPREDUCE-2454:
---------------------------------------------

Hi Owen,
  Thanks for your comments.  I like your suggestion on the signature of 
initialize() method and also not having a flush().  However, I prefer to pass 
the Key and Value as objects instead of serialized ByteArray for the following 
reasons:
* It is easier and more efficient when external program(like UNIX sort command) 
is invoked as a sorter.  The Key and Value types will be Text.  The bytes in 
the Text can be grabbed and passed to the program with a TAB between them.  
There is no need to deserialize data passed in the ByteArray.  This is similar 
to what is happening with hadoop streaming when for example a Mapper is 
implemented by an external program.  Also, on the Map side the output of the 
mapper is key and value objects which can be directly passed to the sorter.  
Thus there is no need for extra serializtion/deserialization.  Similar argument 
applies when output of the sorter is read on the Reduce side using RecordReader.
* The framework's serialization is in no way affected.  It is free to replace 
the serialization layer.  The external sorter can store the sorted output as 
simple UNIX text records in the final map output file since it will deal with 
the shuffled data on the Reduce side.
* For the RecordReader, I think it is better to change the signature of 
getKey() and getValue() as below:
{code:title=RecordReader}
Object getKey(Object key) // If key is null, it will be allocated first.
Object getValue(Object value) // If value is null, it will allocated first.
{code}
The reasons for these signatures are:
   ** The RecordReader will be used for running Combiner and Reducer.  This may 
involve saving the last seen key.  If the caller passes the key object, it can 
just save the object handle not the entire object since it owns the object.  If 
the callee is returning its own object, it is ephemeral and so the caller has 
to save it which results in extra copying.
   ** Creating an adapter to return key and value objects from their serialized 
counterparts(that is from RawKeyValueIterator) will not result in any extra 
data copying.  So the performance of the framework's sorter will not degrade.

Owen, do you have any suggestion on a committer with whom I can work on this?
Thanks.

> Allow external sorter plugin for MR
> -----------------------------------
>
>                 Key: MAPREDUCE-2454
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Mariappan Asokan
>            Priority: Minor
>         Attachments: KeyValueIterator.java, MapOutputSorter.java, 
> MapOutputSorterAbstract.java, ReduceInputSorter.java
>
>
> Define interfaces and some abstract classes in the Hadoop framework to 
> facilitate external sorter plugins both on the Map and Reduce sides.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to