[
https://issues.apache.org/jira/browse/MAPREDUCE-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773756#action_12773756
]
Arun C Murthy commented on MAPREDUCE-1183:
------------------------------------------
The current Configuration-based system has issues in a couple of use-cases:
# The primary drawback: Difficulty in implementing a
Composite{Input|Output}Format
Pig is in the middle of a re-write of their Load/Store interfaces
(http://wiki.apache.org/pig/LoadStoreRedesignProposal) where they want to be
able to take an arbitrary InputFormat or OutputFormat and wrap it for use
within Pig. Similarly a 'CompositeInputFormat' which can work with multiple
InputFormats (say a map-side merge between data in multiple SequenceFiles and
TFiles) leads to a situation where we push the {Input|Output}Format to deal
with multiple copies of Configuration and manage them. This necessary because
using a single Configuration results in same configuration key being
over-written by multiple instances of {Input|Output}Format (say
mapred.input.dir over-written by SequenceFileInputFormat and TFileInputFormat).
# Annoyance: An application which needs a very small amount of state in the
Mapper/Reducer (say a small map of metadata) is forced to use DistributedCache,
it's much more natural to have that state stored in the Mapper/Reducer and have
it serialized from the client to the compute nodes.
Thus the proposal is to move to a model where an actual
Mapper/Reducer/InputFormat/OutputFormat object is serialized by the framework,
thus eliminating the need for using Configuration for storing the requisite
information and using the object to keep the necessary state e.g.
FileInputFormat will have a member to keep a list of input-paths to be
processed.
The new api would look like:
{noformat}
Job job = new Job();
job.setMapper(new WordCountMapper());
job.setReducer(new WordCountReducer());
InputFormat in = new TextInputFormat("in");
in.addInputPath("in2");
OutputFormat out = new TextOutputFormat("out");
job.setInputFormat(in);
job.setOutputFormat(out);
job.waitForCompletion();
{noformat}
----
Thoughts?
> Serializable job components: Mapper, Reducer, InputFormat, OutputFormat et al
> -----------------------------------------------------------------------------
>
> Key: MAPREDUCE-1183
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1183
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: client
> Affects Versions: 0.21.0
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
>
> Currently the Map-Reduce framework uses Configuration to pass information
> about the various aspects of a job such as Mapper, Reducer, InputFormat,
> OutputFormat, OutputCommitter etc. and application developers use
> org.apache.hadoop.mapreduce.Job.set*Class apis to set them at job-submission
> time:
> {noformat}
> Job.setMapperClass(IdentityMapper.class);
> Job.setReducerClass(IdentityReducer.class);
> Job.setInputFormatClass(TextInputFormat.class);
> Job.setOutputFormatClass(TextOutputFormat.class);
> ...
> {noformat}
> The proposal is that we move to a model where end-users interact with
> org.apache.hadoop.mapreduce.Job via actual objects which are then serialized
> by the framework:
> {noformat}
> Job.setMapper(new IdentityMapper());
> Job.setReducer(new IdentityReducer());
> Job.setInputFormat(new TextInputFormat("in"));
> Job.setOutputFormat(new TextOutputFormat("out"));
> ...
> {noformat}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.