Contribution: FixedLengthInputFormat and FixedLengthRecordReader
----------------------------------------------------------------
Key: MAPREDUCE-1176
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176
Project: Hadoop Map/Reduce
Issue Type: New Feature
Affects Versions: 0.20.1
Environment: Any
Reporter: BitsOfInfo
Priority: Minor
Hello,
I would like to contribute the following two classes for incorporation into the
mapreduce.lib.input package. These two classes can be used when you need to
read data from files containing fixed length (fixed width) records. Such files
have no CR/LF (or any combination thereof), no delimiters etc, but each record
is a fixed length, and extra data is padded with spaces. The data is one
gigantic line within a file.
Provided are two classes first is the FixedLengthInputFormat and its
corresponding FixedLengthRecordReader. When creating a job that specifies this
input format, the job must have the
"mapreduce.input.fixedlengthinputformat.record.length" property set as follows
myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]);
OR
myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH,
[myFixedRecordLength]);
This input format overrides computeSplitSize() in order to ensure that
InputSplits do not contain any partial records since with fixed records there
is no way to determine where a record begins if that were to occur. Each
InputSplit passed to the FixedLengthRecordReader will start at the beginning of
a record, and the last byte in the InputSplit will be the last byte of a
record. The override of computeSplitSize() delegates to FileInputFormat's
compute method, and then adjusts the returned split size by doing the
following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) *
fixedRecordLength)
This suite of fixed length input format classes, does not support compressed
files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.