Re: [jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs

Arkady Borkovsky Thu, 24 Aug 2006 09:42:21 -0700

+1 for Owen's arguments.

Although at the code level, anything that is done to the input recordcan be put either into an InputFormat or into a Mapper, it seems to bequite important to force a clear separation these concepts.A more naive user may prefer to be completely ignorant about the notionof InputFormat.

Defining interfaces that a easy to understand is more than justsyntactic sugar, and usability should not be sacrificed toorthogonality.



On Aug 23, 2006, at 11:49 PM, Doug Cutting (JIRA) wrote:

[http://issues.apache.org/jira/browse/HADOOP-372?page=comments#action_12430107 ]
Doug Cutting commented on HADOOP-372:
-------------------------------------
A very typical case is to have the same input format, but differentMappers
But, if the mapper is a function of the input format this can insteadbe:
job.addInputPath("foo", FooInput.class);
job.addInputPath("bar", BarInput.class);

Where FooInput is defined with something like:

public class FooInput extends TextInput {
  public void map(...) { ... };
}
In other words, if you're going to define custom mappers anyway, thenit's no more work to define custom Input formats.
should allow to specify different inputformat classes for differentinput dirs for Map/Reduce jobs--------------------------------------------------------------------------------------------------
                Key: HADOOP-372
                URL: http://issues.apache.org/jira/browse/HADOOP-372
            Project: Hadoop
         Issue Type: New Feature
         Components: mapred
   Affects Versions: 0.4.0
        Environment: all
           Reporter: Runping Qi
        Assigned To: Owen O'Malley
Right now, the user can specify multiple input directories for a mapreduce job.However, the files under all the directories are assumed to be in thesame format,with the same key/value classes. This proves to be a serious limitin many situations.
Here is an example. Suppose I have three simple tables:
one has URLs and their rank values (page ranks),
another has URLs and their classification values,
and the third one has the URL meta data such as crawl status, lastcrawl time, etc.Suppose now I need a job to generate a list of URLs to be crawlednext.
The decision depends on the info in all the three tables.
Right now, there is no easy way to accomplish this.
However, this job can be done if the framework allows to specifydifferent inputformats for different input dirs.Suppose my three tables are in the following directory respectively:rankTable, classificationTable. and metaDataTable.If we extend JobConf class with the following method (as Owensuggested to me):addInputPath(aPath, anInputFormatClass, anInputKeyClass,anInputValueClass)
Then I can specify my job as follows:
addInputPath(rankTable, SequenceFileInputFormat.class,UTF8.class, DoubleWritable.class)addInputPath(classificationTable, TextInputFormat.class,UTF8,class, UTF8.class)addInputPath(metaDataTable, SequenceFileInputFormat.class,UTF8.class, MyRecord.class)If an input directory is added through the current API, it will havethe same meaning as it is now.Thus this extension will not affect any applications that do not needthis new feature.It is relatively easy for the M/R framework to create an appropriaterecord reader for a map task based on the above information.
And that is the only change needed for supporting this extension.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of theadministrators:http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:http://www.atlassian.com/software/jira

Re: [jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs

Reply via email to