[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader

Mariappan Asokan (JIRA) Tue, 06 Aug 2013 11:14:13 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13731036#comment-13731036
 ]


Mariappan Asokan commented on MAPREDUCE-1176:
---------------------------------------------

I was looking for an implementation of this record format as well.  I agree 
with the following comment by Todd:
{quote}
As a general note, I'm not sure I agree with the design here. Rather than 
forcing the split to lie on record boundaries, I think it would be simpler to 
simply let FileInputFormat compute its own splits, and then when you first open 
the record reader, skip forward to the next record boundary and begin reading 
from there. Then for the last record of the file, "over read" your split into 
the beginning of the next one. This is the strategy that other input formats 
take, and should be compatible with the splittable compression codecs (see 
TextInputFormat for example).
{quote}
I think we should support fixed length records spanning across HDFS blocks.

BitsOfInfo, do you mind if I pick up your patch, enhance it to take care of the 
above case, and post a patch for the trunk?

I would appreciate if a committer can come forward to review the patch and 
commit it to the trunk.

Thanks.

-- Asokan
                
> Contribution: FixedLengthInputFormat and FixedLengthRecordReader
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-1176
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 0.20.1, 0.20.2
>         Environment: Any
>            Reporter: BitsOfInfo
>         Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, 
> MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch
>
>
> Hello,
> I would like to contribute the following two classes for incorporation into 
> the mapreduce.lib.input package. These two classes can be used when you need 
> to read data from files containing fixed length (fixed width) records. Such 
> files have no CR/LF (or any combination thereof), no delimiters etc, but each 
> record is a fixed length, and extra data is padded with spaces. The data is 
> one gigantic line within a file.
> Provided are two classes first is the FixedLengthInputFormat and its 
> corresponding FixedLengthRecordReader. When creating a job that specifies 
> this input format, the job must have the 
> "mapreduce.input.fixedlengthinputformat.record.length" property set as follows
> myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]);
> OR
> myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, 
> [myFixedRecordLength]);
> This input format overrides computeSplitSize() in order to ensure that 
> InputSplits do not contain any partial records since with fixed records there 
> is no way to determine where a record begins if that were to occur. Each 
> InputSplit passed to the FixedLengthRecordReader will start at the beginning 
> of a record, and the last byte in the InputSplit will be the last byte of a 
> record. The override of computeSplitSize() delegates to FileInputFormat's 
> compute method, and then adjusts the returned split size by doing the 
> following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) 
> * fixedRecordLength)
> This suite of fixed length input format classes, does not support compressed 
> files. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader

Reply via email to