[
https://issues.apache.org/jira/browse/MAPREDUCE-6598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083086#comment-15083086
]
Jason Lowe commented on MAPREDUCE-6598:
---------------------------------------
LineReader already supports a custom record delimiter. There are a number of
constructors that take a byte array specifying the record delimiter bytes.
This in turn is also supported by LineRecordReader which internally uses
LineReader.
> LineReader enhencement to support text records contains "\n"
> ------------------------------------------------------------
>
> Key: MAPREDUCE-6598
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6598
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mrv2
> Affects Versions: 2.6.0
> Environment: RHEL 7, Spark 1.3.1, Hadoop 2.6.0
> Reporter: cloudyarea
> Priority: Minor
>
> We have billions of XML message records stored on text files need to be
> parsed parallel by Spark. By default, Spark open a Hadoop text file using
> LineReader which provides a single line of text as a record.
> The XML messages contains "\n" and I believe it is a common scenario - many
> users have cross-line records. Currently, the solution is to the extend the
> interface RecordReader.
> To reduce the repeat work, I wrote a class named MessageRecordReader to
> extend the interface RecordReader, user can set a string as record delimiter,
> then MessageRecordReader provides a multiple line record to user.
> I would like to contribute the code to community. Please let me know if you
> are interested in this simple but useful implementation.
> Thank you very much and happy new year!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)