[ 
https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596374#action_12596374
 ] 

Devaraj Das commented on HADOOP-3221:
-------------------------------------

I am tending to think that FileSplit based approach is the better one. The 
reasons:
1) We don't invent brand new input formats. We reuse what exists and the amount 
of new code is minimal (at a high level, it seems like only 
FileInputFormat.getSplits and FileSplit.getLocations needs to be overridden)
2) We are better at handling the cases of large files. Granted that with 1 line 
per map, we might have the same problem with FileSplit. But we could work 
around that by having a larger N.
3) We don't make assumptions about the line lengths, etc. Just make one pass 
over the files and arrive at the splits.

The only issue is that we might end up in a situation where a couple of 
datanodes in the cluster becomes a bottleneck for the split serving. But that 
could be handled by having a higher replication factor of such files (just like 
we handle job.jar, etc.). 

Thoughts?

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the 
> same input file (s), but with computations are controlled by different 
> parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) 
> as input in a control file (which is the input path to the map-reduce 
> application, where as the input dataset is specified via a config variable in 
> JobConf.).
> It would be great to have an InputFormat, that splits the input file such 
> that by default, one line is fed as a value to one map task, and key could be 
> line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a 
> contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, 
> but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of 
> nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be 
> fetched by all the nodes) is orthogonal, and one can use DistributedCache for 
> that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  
> "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to