[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Chris Douglas (JIRA) Mon, 12 May 2008 15:24:20 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596219#action_12596219
 ]


Chris Douglas commented on HADOOP-3221:
---------------------------------------

bq. A pass over the input files containing the lines will tell us how many 
lines there are. The number of maps that the user desires will give us the 
number of lines per map (goalsize). The offsets in the input files can then be 
derived in a second pass over the input files (with the pass breaking at file 
boundaries just like the FileSplit case).

For applications with one map per line of text (depressingly many, particularly 
for prototypes and research projects), the approach this patch takes makes some 
sense. For a line length of 40 to 100 characters, a FileSplit- even sans 
location information- is likely no smaller than the data it describes. Given 
this potential advantage, there are at least two cases that this implementation 
includes that work against that model. The first, obviously, is large files; a 
property defining the maximum aggregate file size is pretty much required to 
prevent accidents. The second is specifying the number of maps and getting 
splits with an even number of lines. That adds little value over the default, 
since in practice most inputs will have fairly uniform line lengths; the 
estimates should be very close, so the second pass has limited value. If one 
wants to use multiple lines per map for load balancing only, then generating 
splits in the usual way is sufficient, unless the "line number as key" is a 
requirement and the offset isn't enough.

The purpose of this class would be much clearer if the user were required to 
provide N. I think it's OK to read the lines into the splits, as long as the 
total size is kept low. Ideally, this would mix stripped-down FileSpits with 
LineSplits (line literals) based on size, but that's probably overdoing it. 
It's probably sufficient to add a (starting) line number to LineSplit, add 
safety checks for maximum input size, and change its behavior to be N lines per 
split, rather than the current behavior. Thoughts? I think this should satisfy 
the requirements and- at least to me- clarifies and narrows where this new 
InputFormat may be used.

> Need a "LineBasedTextInputFormat"
> ---------------------------------
>
>                 Key: HADOOP-3221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3221
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.16.2
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.18.0
>
>         Attachments: patch-3221-1.txt, patch-3221.txt
>
>
> In many "pleasantly" parallel applications, each process/mapper processes the 
> same input file (s), but with computations are controlled by different 
> parameters.
> (Referred to as "parameter sweeps").
> One way to achieve this, is to specify a set of parameters (one set per line) 
> as input in a control file (which is the input path to the map-reduce 
> application, where as the input dataset is specified via a config variable in 
> JobConf.).
> It would be great to have an InputFormat, that splits the input file such 
> that by default, one line is fed as a value to one map task, and key could be 
> line number. i.e. (k,v) is (LongWritable, Text).
> If user specifies the number of maps explicitly, each mapper should get a 
> contiguous chunk of lines (so as to load balance between the mappers.)
> The location hints for the splits should not be derived from the input file, 
> but rather, should span the whole mapred cluster.
> (Is there a way to do this without having to return an array of 
> nSplits*nTaskTrackers ?)
> Increasing the replication of the "real" input dataset (since it will be 
> fetched by all the nodes) is orthogonal, and one can use DistributedCache for 
> that.
> (P.S. Please chose a better name for this InputFormat. I am not in love with  
> "LineBasedText" name.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3221) Need a "LineBasedTextInputFormat"

Reply via email to