[jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs

Runping Qi (JIRA) Thu, 20 Jul 2006 12:55:17 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-372?page=comments#action_12422467 ] 
            
Runping Qi commented on HADOOP-372:
-----------------------------------



Doug,

My thought is to add a Map object to the JobConf class that keep track the 
explicit association between input path and the classes (inputformat, key and 
value). If there is no entry in the Map object for a Path, then the path is 
associated with the classes set through the current APIs 
(setInputFormatClass(), setInputKeyClass, setInputValueClass) (or default 
classes if they are not set). It will be convenient if the JobConf class 
provides APIs for getting/setting them:

    class getInputFormatClassForPath(Path p) 
    class getInputKeyClassForPath(Path p) 
    class getInputValueClassForPath(Path p) 

It is a good idea to move  the getInputKeyClass() and getInputValueClass() 
methods to RecordReader (and this should be more logical!).
This is easy to achieve since the implementation of the method  of InputFormat 
Interface 
               getRecordReader FileSystem fs, FileSplit split, JobConf job, 
Reporter reporter) 
has enough information to extract the key/value classes for the split.

It is also convenient to let Split class keep track for the inputformat, key, 
value class, so that we can get the right RecordReader for a given split.

In MapTask class, use  the following lines:
     final RecordReader rawIn =                  // open input
        split.getInputFormat().getRecordReader
        (FileSystem.get(job), split, job, reporter);
to replace
     final RecordReader rawIn =                  // open input
        job.getInputFormat().getRecordReader
        (FileSystem.get(job), split, job, reporter);

Finally, the the getSplits of InputFormatBase class should be changed to a 
static method, since it is independent of any concrete InputFormat 
implementation (and this is kind of necessary, since the exact inputformat will 
not be known prior to creating splits). The initTasks() method of 
JobInProgress class needs to make sure all the inputformat classes are loaded 
properly, or let getSplits() method to take care of it.

That should have covered most of the needed changes


> should allow to specify different inputformat classes for different input 
> dirs for Map/Reduce jobs
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-372
>                 URL: http://issues.apache.org/jira/browse/HADOOP-372
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.4.0
>         Environment: all
>            Reporter: Runping Qi
>
> Right now, the user can specify multiple input directories for a map reduce 
> job. 
> However, the files under all the directories are assumed to be in the same 
> format, 
> with the same key/value classes. This proves to be  a serious limit in many 
> situations. 
> Here is an example. Suppose I have three simple tables: 
> one has URLs and their rank values (page ranks), 
> another has URLs and their classification values, 
> and the third one has the URL meta data such as crawl status, last crawl 
> time, etc. 
> Suppose now I need a job to generate a list of URLs to be crawled next. 
> The decision depends on the info in all the three tables.
> Right now, there is no easy way to accomplish this.
> However, this job can be done if the framework allows to specify different 
> inputformats for different input dirs.
> Suppose my three tables are in the following directory respectively: 
> rankTable, classificationTable. and metaDataTable. 
> If we extend JobConf class with the following method (as Owen suggested to 
> me):
>     addInputPath(aPath, anInputFormatClass, anInputKeyClass, 
> anInputValueClass)
> Then I can specify my job as follows:
>     addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class, 
> DoubleWritable.class)
>     addInputPath(classificationTable, TextInputFormat.class, UTF8,class, 
> UTF8.class)
>     addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class, 
> MyRecord.class)
> If an input directory is added through the current API, it will have the same 
> meaning as it is now. 
> Thus this extension will not affect any applications that do not need this 
> new feature.
> It is relatively easy for the M/R framework to create an appropriate record 
> reader for a map task based on the above information.
> And that is the only change needed for supporting this extension.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs

Reply via email to