[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Milind Bhandarkar (JIRA) Wed, 14 May 2008 03:38:20 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596714#action_12596714
 ]


Milind Bhandarkar commented on HADOOP-3221:
-------------------------------------------

FileSplit approach will work (although the replication factor for the parameter 
list has to be increased to 10 - similar to job.jar), as Devaraj describes it. 
Each map should get exactly one line, no more , no less. So, file offsets in 
split have to be exact for that case (not file-length / 80 or something.) 
Having exact offsets, pointing to each \n will make LineRecordReader reusable, 
in this case, right ? The Unit test needs to test this. Current 
OneLineInputFormat that Lohit built uses this approach, and users have been 
happy with it.

In case of N lines per mapper also, same approach should work, but will require 
two passes over the input file. First to calculate the number of lines, and 
then computing the splits. If number of lines is not divisible by number of 
mappers, its ok to have the last mapper consume less lines (although, dividing 
the slack among more than one mapper will be better).

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the 
> same input file (s), but with computations are controlled by different 
> parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) 
> as input in a control file (which is the input path to the map-reduce 
> application, where as the input dataset is specified via a config variable in 
> JobConf.).
> It would be great to have an InputFormat, that splits the input file such 
> that by default, one line is fed as a value to one map task, and key could be 
> line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a 
> contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, 
> but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of 
> nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be 
> fetched by all the nodes) is orthogonal, and one can use DistributedCache for 
> that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  
> "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Reply via email to