[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736956#comment-13736956
 ] 

Mariappan Asokan commented on MAPREDUCE-1176:
---------------------------------------------

I went over the code for TextInputFormat.  Here are my conclusions:
* I think we can make FixedLengthInputFormat look very similar to 
TextInputFormat.  Specifically the key can be LongWritable(which indicates the 
position of the record in the file) and value can be BytesWritable(since the 
records can contain arbitrary binary data.)  Also the implementation can be 
simpler and similar to TextInputFormat.  There is no need for custom key and 
value settings.
* Splittable compressed input can be supported.
* Since the start location of each split is available(in a FileSplit object) it 
is easy to compute the number of bytes to skip at the beginning of each split.

I will proceed with the implementation and post a patch.  Also, I raised 
MAPREDUCE-5455 to support the complement namely FixedLengthOutputFormat.  With 
these, I think we can cut down some CPU time in TeraSort benchmark since the 
records are 100 bytes long and have fixed lengths.  There is no need for 
byte-by-byte scanning to identify records.
                
> Contribution: FixedLengthInputFormat and FixedLengthRecordReader
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-1176
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 0.20.1, 0.20.2
>         Environment: Any
>            Reporter: BitsOfInfo
>            Assignee: Mariappan Asokan
>         Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, 
> MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch
>
>
> Hello,
> I would like to contribute the following two classes for incorporation into 
> the mapreduce.lib.input package. These two classes can be used when you need 
> to read data from files containing fixed length (fixed width) records. Such 
> files have no CR/LF (or any combination thereof), no delimiters etc, but each 
> record is a fixed length, and extra data is padded with spaces. The data is 
> one gigantic line within a file.
> Provided are two classes first is the FixedLengthInputFormat and its 
> corresponding FixedLengthRecordReader. When creating a job that specifies 
> this input format, the job must have the 
> "mapreduce.input.fixedlengthinputformat.record.length" property set as follows
> myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]);
> OR
> myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, 
> [myFixedRecordLength]);
> This input format overrides computeSplitSize() in order to ensure that 
> InputSplits do not contain any partial records since with fixed records there 
> is no way to determine where a record begins if that were to occur. Each 
> InputSplit passed to the FixedLengthRecordReader will start at the beginning 
> of a record, and the last byte in the InputSplit will be the last byte of a 
> record. The override of computeSplitSize() delegates to FileInputFormat's 
> compute method, and then adjusts the returned split size by doing the 
> following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) 
> * fixedRecordLength)
> This suite of fixed length input format classes, does not support compressed 
> files. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to