[
https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13731149#comment-13731149
]
BitsOfInfo commented on MAPREDUCE-1176:
---------------------------------------
Asokan: Sure go ahead make whatever changes are necessary; as I have no time to
work on this anymore; yet would like to see this put into the project as I had
a use for it when I created it and I'm sure others do as well.
BTW: Never had my original question answered from a few years ago in regards to
the "design", maybe I'm was missing something.
bq. "Hmm, ok, do you have suggestion on how I detect where one record begins
and one record ends when records are not identifiable by any sort of consistent
"start" character or "end" character "boundary" but just flow together? I could
see the RecordReader detecting that it only read < RECORD LENGTH bytes and
hitting the end of the split and discarding it. But I am not sure how it would
detect the start of a record, with a split that has partial data at the start
of it. Especially if there is no consistent boundary/char marker that
identifies the start of a record."
> Contribution: FixedLengthInputFormat and FixedLengthRecordReader
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-1176
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Affects Versions: 0.20.1, 0.20.2
> Environment: Any
> Reporter: BitsOfInfo
> Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch,
> MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch
>
>
> Hello,
> I would like to contribute the following two classes for incorporation into
> the mapreduce.lib.input package. These two classes can be used when you need
> to read data from files containing fixed length (fixed width) records. Such
> files have no CR/LF (or any combination thereof), no delimiters etc, but each
> record is a fixed length, and extra data is padded with spaces. The data is
> one gigantic line within a file.
> Provided are two classes first is the FixedLengthInputFormat and its
> corresponding FixedLengthRecordReader. When creating a job that specifies
> this input format, the job must have the
> "mapreduce.input.fixedlengthinputformat.record.length" property set as follows
> myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]);
> OR
> myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH,
> [myFixedRecordLength]);
> This input format overrides computeSplitSize() in order to ensure that
> InputSplits do not contain any partial records since with fixed records there
> is no way to determine where a record begins if that were to occur. Each
> InputSplit passed to the FixedLengthRecordReader will start at the beginning
> of a record, and the last byte in the InputSplit will be the last byte of a
> record. The override of computeSplitSize() delegates to FileInputFormat's
> compute method, and then adjusts the returned split size by doing the
> following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength)
> * fixedRecordLength)
> This suite of fixed length input format classes, does not support compressed
> files.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira