[jira] Commented: (MAPREDUCE-1183) Serializable job components: Mapper, Reducer, InputFormat, OutputFormat et al

Arun C Murthy (JIRA) Wed, 04 Nov 2009 18:42:59 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773756#action_12773756
 ]


Arun C Murthy commented on MAPREDUCE-1183:
------------------------------------------

The current Configuration-based system has issues in a couple of use-cases:
# The primary drawback: Difficulty in implementing a 
Composite{Input|Output}Format
Pig is in the middle of a re-write of their Load/Store interfaces 
(http://wiki.apache.org/pig/LoadStoreRedesignProposal) where they want to be 
able to take an arbitrary InputFormat or OutputFormat and wrap it for use 
within Pig. Similarly a 'CompositeInputFormat' which can work with multiple 
InputFormats (say a map-side merge between data in multiple SequenceFiles and 
TFiles) leads to a situation where we push the {Input|Output}Format to deal 
with multiple copies of Configuration and manage them. This necessary because 
using a single Configuration results in same configuration key being 
over-written by multiple instances of {Input|Output}Format (say 
mapred.input.dir over-written by SequenceFileInputFormat and TFileInputFormat).
# Annoyance: An application which needs a very small amount of state in the 
Mapper/Reducer (say a small map of metadata) is forced to use DistributedCache, 
it's much more natural to have that state stored in the Mapper/Reducer and have 
it serialized from the client to the compute nodes.

Thus the proposal is to move to a model where an actual 
Mapper/Reducer/InputFormat/OutputFormat object is serialized by the framework, 
thus eliminating the need for using Configuration for storing the requisite 
information and using the object to keep the necessary state e.g. 
FileInputFormat will have a member to keep a list of input-paths to be 
processed.

The new api would look like:
{noformat}
Job job = new Job();
job.setMapper(new WordCountMapper());
job.setReducer(new WordCountReducer());
InputFormat in = new TextInputFormat("in");
in.addInputPath("in2");
OutputFormat out = new TextOutputFormat("out");
job.setInputFormat(in);
job.setOutputFormat(out);
job.waitForCompletion();
{noformat}

----

Thoughts?

> Serializable job components: Mapper, Reducer, InputFormat, OutputFormat et al
> -----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1183
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1183
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 0.21.0
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>
> Currently the Map-Reduce framework uses Configuration to pass information 
> about the various aspects of a job such as Mapper, Reducer, InputFormat, 
> OutputFormat, OutputCommitter etc. and application developers use 
> org.apache.hadoop.mapreduce.Job.set*Class apis to set them at job-submission 
> time:
> {noformat}
> Job.setMapperClass(IdentityMapper.class);
> Job.setReducerClass(IdentityReducer.class);
> Job.setInputFormatClass(TextInputFormat.class);
> Job.setOutputFormatClass(TextOutputFormat.class);
> ...
> {noformat}
> The proposal is that we move to a model where end-users interact with 
> org.apache.hadoop.mapreduce.Job via actual objects which are then serialized 
> by the framework:
> {noformat}
> Job.setMapper(new IdentityMapper());
> Job.setReducer(new IdentityReducer());
> Job.setInputFormat(new TextInputFormat("in"));
> Job.setOutputFormat(new TextOutputFormat("out"));
> ...
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1183) Serializable job components: Mapper, Reducer, InputFormat, OutputFormat et al

Reply via email to