[jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs

Johan Oskarsson (JIRA) Fri, 20 Jun 2008 06:50:36 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606757#action_12606757
 ]


Johan Oskarsson commented on HADOOP-372:
----------------------------------------

We've unfortunately run into an issue with this patch, if we add a lot of input 
directories we get a huge number of map tasks.
In this case almost all of the directories have the same mappers and input 
formats.

I'm guessing that the issue is that our input format's getSplits will be called 
separately on each directory instead of over all of the directories at once. 
Meaning that if we have a directory with for example one 10mb file it will be 
split up into many map jobs. Normally it would be a part of many files and 
probably end up into just one map job.

Perhaps it would be possible to merge all directories using the same input 
format and mapper into one input format getSplits call?

> should allow to specify different inputformat classes for different input 
> dirs for Map/Reduce jobs
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-372
>                 URL: https://issues.apache.org/jira/browse/HADOOP-372
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.4.0
>         Environment: all
>            Reporter: Runping Qi
>            Assignee: Owen O'Malley
>         Attachments: hadoop-372.patch
>
>
> Right now, the user can specify multiple input directories for a map reduce 
> job. 
> However, the files under all the directories are assumed to be in the same 
> format, 
> with the same key/value classes. This proves to be  a serious limit in many 
> situations. 
> Here is an example. Suppose I have three simple tables: 
> one has URLs and their rank values (page ranks), 
> another has URLs and their classification values, 
> and the third one has the URL meta data such as crawl status, last crawl 
> time, etc. 
> Suppose now I need a job to generate a list of URLs to be crawled next. 
> The decision depends on the info in all the three tables.
> Right now, there is no easy way to accomplish this.
> However, this job can be done if the framework allows to specify different 
> inputformats for different input dirs.
> Suppose my three tables are in the following directory respectively: 
> rankTable, classificationTable. and metaDataTable. 
> If we extend JobConf class with the following method (as Owen suggested to 
> me):
>     addInputPath(aPath, anInputFormatClass, anInputKeyClass, 
> anInputValueClass)
> Then I can specify my job as follows:
>     addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class, 
> DoubleWritable.class)
>     addInputPath(classificationTable, TextInputFormat.class, UTF8,class, 
> UTF8.class)
>     addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class, 
> MyRecord.class)
> If an input directory is added through the current API, it will have the same 
> meaning as it is now. 
> Thus this extension will not affect any applications that do not need this 
> new feature.
> It is relatively easy for the M/R framework to create an appropriate record 
> reader for a map task based on the above information.
> And that is the only change needed for supporting this extension.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs

Reply via email to