[jira] [Commented] (MAPREDUCE-2254) Allow setting of end-of-record delimiter for TextInputFormat

Michele Giusto (JIRA) Mon, 13 Jan 2014 08:11:02 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869600#comment-13869600
 ]


Michele Giusto commented on MAPREDUCE-2254:
-------------------------------------------

Hi everybody, I believe there is a bug when the custom record delimiter is 
longer than 1 character. For example using as delimiter "#$&" and having the 
line "...record1#$&record2#$&record3#$&..." divided between 2 consecutive input 
splits, with the second input split beginning after the first "$" (so it starts 
with "&record2#$&record3#$&..."), "record2" will not not be read.
This is due to the fact that the mapper that processes the second split starts 
reading from the last character of the first input split (so the "$"), then it 
looses the delimiter between "record1" and "record2". In this way the 
constructor of the mapper tries to skip the last line of the previous input 
split but it instead skips the first line of its one and reports "record3" as 
the first line. 
If you agree that this is a bug, a possible solution may be to modify the 
LineRecordReader class to start reading each input split (except the first one) 
not from the last character of the previous input split but going back a number 
of characters equals to the number of characters of the record delimiter (3 in 
my example). 

> Allow setting of end-of-record delimiter for TextInputFormat
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-2254
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2254
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2245.patch, MAPREDUCE-2254_r2.patch, 
> MAPREDUCE-2254_r3.patch
>
>
> It will be useful to allow setting the end-of-record delimiter for 
> TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as 
> the only possible record delimiters. This is a problem if users have embedded 
> newlines in their data fields (which is pretty common). This is also a 
> problem for other tools using this TextInputFormat (See for example: 
> https://issues.apache.org/jira/browse/PIG-836 and 
> https://issues.cloudera.org/browse/SQOOP-136).
> I have wrote a patch to address this issue. This patch allows users to 
> specify any custom end-of-record delimiter using a new added configuration 
> property. For backward compatibility, if this new configuration property is 
> absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or 
> '\r\n').



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (MAPREDUCE-2254) Allow setting of end-of-record delimiter for TextInputFormat

Reply via email to