cloudyarea created MAPREDUCE-6598:
-------------------------------------
Summary: LineReader enhencement to support text records contains
"\n"
Key: MAPREDUCE-6598
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6598
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: mrv2
Affects Versions: 2.6.0
Environment: RHEL 7, Spark 1.3.1, Hadoop 2.6.0
Reporter: cloudyarea
Priority: Minor
We have billions of XML message records stored on text files need to be parsed
parallel by Spark. By default, Spark open a Hadoop text file using LineReader
which provides a single line of text as a record.
The XML messages contains "\n" and I believe it is a common scenario - many
users have cross-line records. Currently, the solution is to the extend the
interface RecordReader.
To reduce the repeat work, I wrote a class named MessageRecordReader to extend
the interface RecordReader, user can set a string as record delimiter, then
MessageRecordReader provides a multiple line record to user.
I would like to contribute the code to community. Please let me know if you are
interested in this simple but useful implementation.
Thank you very much and happy new year!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)