[
https://issues.apache.org/jira/browse/MAPREDUCE-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869600#comment-13869600
]
Michele Giusto commented on MAPREDUCE-2254:
-------------------------------------------
Hi everybody, I believe there is a bug when the custom record delimiter is
longer than 1 character. For example using as delimiter "#$&" and having the
line "...record1#$&record2#$&record3#$&..." divided between 2 consecutive input
splits, with the second input split beginning after the first "$" (so it starts
with "&record2#$&record3#$&..."), "record2" will not not be read.
This is due to the fact that the mapper that processes the second split starts
reading from the last character of the first input split (so the "$"), then it
looses the delimiter between "record1" and "record2". In this way the
constructor of the mapper tries to skip the last line of the previous input
split but it instead skips the first line of its one and reports "record3" as
the first line.
If you agree that this is a bug, a possible solution may be to modify the
LineRecordReader class to start reading each input split (except the first one)
not from the last character of the previous input split but going back a number
of characters equals to the number of characters of the record delimiter (3 in
my example).
> Allow setting of end-of-record delimiter for TextInputFormat
> ------------------------------------------------------------
>
> Key: MAPREDUCE-2254
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2254
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: Ahmed Radwan
> Assignee: Ahmed Radwan
> Fix For: 0.23.0
>
> Attachments: MAPREDUCE-2245.patch, MAPREDUCE-2254_r2.patch,
> MAPREDUCE-2254_r3.patch
>
>
> It will be useful to allow setting the end-of-record delimiter for
> TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as
> the only possible record delimiters. This is a problem if users have embedded
> newlines in their data fields (which is pretty common). This is also a
> problem for other tools using this TextInputFormat (See for example:
> https://issues.apache.org/jira/browse/PIG-836 and
> https://issues.cloudera.org/browse/SQOOP-136).
> I have wrote a patch to address this issue. This patch allows users to
> specify any custom end-of-record delimiter using a new added configuration
> property. For backward compatibility, if this new configuration property is
> absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or
> '\r\n').
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)