[
https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029596#comment-13029596
]
Mariappan Asokan commented on MAPREDUCE-2454:
---------------------------------------------
Hi Owen,
Thanks for your comments. I like your suggestion on the signature of
initialize() method and also not having a flush(). However, I prefer to pass
the Key and Value as objects instead of serialized ByteArray for the following
reasons:
* It is easier and more efficient when external program(like UNIX sort command)
is invoked as a sorter. The Key and Value types will be Text. The bytes in
the Text can be grabbed and passed to the program with a TAB between them.
There is no need to deserialize data passed in the ByteArray. This is similar
to what is happening with hadoop streaming when for example a Mapper is
implemented by an external program. Also, on the Map side the output of the
mapper is key and value objects which can be directly passed to the sorter.
Thus there is no need for extra serializtion/deserialization. Similar argument
applies when output of the sorter is read on the Reduce side using RecordReader.
* The framework's serialization is in no way affected. It is free to replace
the serialization layer. The external sorter can store the sorted output as
simple UNIX text records in the final map output file since it will deal with
the shuffled data on the Reduce side.
* For the RecordReader, I think it is better to change the signature of
getKey() and getValue() as below:
{code:title=RecordReader}
Object getKey(Object key) // If key is null, it will be allocated first.
Object getValue(Object value) // If value is null, it will allocated first.
{code}
The reasons for these signatures are:
** The RecordReader will be used for running Combiner and Reducer. This may
involve saving the last seen key. If the caller passes the key object, it can
just save the object handle not the entire object since it owns the object. If
the callee is returning its own object, it is ephemeral and so the caller has
to save it which results in extra copying.
** Creating an adapter to return key and value objects from their serialized
counterparts(that is from RawKeyValueIterator) will not result in any extra
data copying. So the performance of the framework's sorter will not degrade.
Owen, do you have any suggestion on a committer with whom I can work on this?
Thanks.
> Allow external sorter plugin for MR
> -----------------------------------
>
> Key: MAPREDUCE-2454
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Reporter: Mariappan Asokan
> Priority: Minor
> Attachments: KeyValueIterator.java, MapOutputSorter.java,
> MapOutputSorterAbstract.java, ReduceInputSorter.java
>
>
> Define interfaces and some abstract classes in the Hadoop framework to
> facilitate external sorter plugins both on the Map and Reduce sides.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira