[jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs

Runping Qi (JIRA) Mon, 21 Aug 2006 12:45:26 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-372?page=comments#action_12429511 ] 
            
Runping Qi commented on HADOOP-372:
-----------------------------------



The patch for Hadoop-450 laid the foundation for this issue. But the specific 
aspects of this issue are yet to be addressed, and I started to work on a patch 
for it. The patch is to:
        * allow the user to specify different input format classes for 
different input directories
        * allow the user to specify different mapper classes for different 
key/value class pair.

My thought is to extend the JobConf class with the following methods:

        public void setInputFormatClass(Class theInputFormatClass, Path p) 

for specifying different input format classes
for specifying different mapper classes.

The FileSplit class should be extended to have a method:
         public Class getInputFormatClass() 

The initTasks method of JobInProgress should make due changes to create file 
FileSplit objects with the correct input format class information. 
        

For supporting different mapper classes, we can expend the JobConf class:

        public void setMapperClass(Class theMapperClass,  Class theKeyClass, 
Class theValueClass) 
        public Class getMapperClass(Class theKeyClass, Class theValueClass) 

The idea is that, for each split, we know the input format class, from there, 
we know the corresponding record reader, then we know the key/value classes of 
the input records.

Another possibility is to allow the user to specify a mapper class per input 
path, in the same way as for the input format class. To do that, the FileSplit 
class needs to support the following method:
        public Class getMapperClass()

Thoughts?
 


> should allow to specify different inputformat classes for different input 
> dirs for Map/Reduce jobs
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-372
>                 URL: http://issues.apache.org/jira/browse/HADOOP-372
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.4.0
>         Environment: all
>            Reporter: Runping Qi
>         Assigned To: Runping Qi
>
> Right now, the user can specify multiple input directories for a map reduce 
> job. 
> However, the files under all the directories are assumed to be in the same 
> format, 
> with the same key/value classes. This proves to be  a serious limit in many 
> situations. 
> Here is an example. Suppose I have three simple tables: 
> one has URLs and their rank values (page ranks), 
> another has URLs and their classification values, 
> and the third one has the URL meta data such as crawl status, last crawl 
> time, etc. 
> Suppose now I need a job to generate a list of URLs to be crawled next. 
> The decision depends on the info in all the three tables.
> Right now, there is no easy way to accomplish this.
> However, this job can be done if the framework allows to specify different 
> inputformats for different input dirs.
> Suppose my three tables are in the following directory respectively: 
> rankTable, classificationTable. and metaDataTable. 
> If we extend JobConf class with the following method (as Owen suggested to 
> me):
>     addInputPath(aPath, anInputFormatClass, anInputKeyClass, 
> anInputValueClass)
> Then I can specify my job as follows:
>     addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class, 
> DoubleWritable.class)
>     addInputPath(classificationTable, TextInputFormat.class, UTF8,class, 
> UTF8.class)
>     addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class, 
> MyRecord.class)
> If an input directory is added through the current API, it will have the same 
> meaning as it is now. 
> Thus this extension will not affect any applications that do not need this 
> new feature.
> It is relatively easy for the M/R framework to create an appropriate record 
> reader for a map task based on the above information.
> And that is the only change needed for supporting this extension.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs

Reply via email to