[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2014-06-27 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046138#comment-14046138
 ] 

Jason Lowe commented on HADOOP-9867:


Actually I agree with Rushabh that there are at least two somewhat different 
problems here.  The original problem reported in the JIRA has to do with 
records being dropped with uncompressed inputs.  We should fix that issue so we 
don't drop data when using an uncompressed input.  I'm assuming Rushabh's patch 
solves that issue, but I haven't looked at it in detail just yet.

There's another issue related to mistaken record delimiter recognition where 
the subsequent split reader can accidentally think it found a delimiter when in 
fact the real record delimiter is somewhere else. If the subsequent split 
reader sees 'yzxxx' at the beginning of its split then it will toss out the 
first record (i.e.: the first 'xxx') then read 'xyz' as the next record.  
However that may or may not be the correct behavior, because with that kind of 
delimiter and data the correct behavior depends upon the _previous_ split's 
data.  If the previous split ended with 'abc' then the behavior was correct and 
there are two records in the stream: 'abc' and 'xyz'.  If the previous split 
ended with 'abcx' then that's the incorrect behavior.  The records should be 
'abc' and 'xxyz' but the second split reader will report an 'xyz' record that 
shouldn't exist.

To solve that problem either a split reader would have to examine the 
prior-split's data to distinguish this case, or the split reader would have to 
realize it's an ambiguous situation and leave the record processing to the 
previous split reader to handle.  The former can be very expensive if the prior 
split is compressed, as it has to potentially unpack the entire split.  Also 
this can get very tricky and a reader may need to read more than one other 
split to resolve it.  For example, if the data stream is 
'ax..xxbxx..xcxx' then a reader may have to 
scan far down into subsequent splits since only it knows where the true record 
boundaries are.  Simply tacking on an extra character at the beginning of that 
input changes where the record boundaries are and the record contents even the 
last split in the input.  Solving this requires a different high-level 
algorithm to split processing than what we have today (i.e.: throw away the 
first record and go), so I believe that's something better left to a followup 
JIRA.

It'd be nice to solve the dropped-record problem for scenarios where we don't 
have to worry about mistaken record delimiter recognition in the data, as 
that's an incremental improvement from where we are today.  I'll try to get 
some time to review the latest patch and provide comments soon.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2014-06-26 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044499#comment-14044499
 ] 

Vinayakumar B commented on HADOOP-9867:
---

Thanks [~shahrs87] for trying out the patch.

I got test failure when the input string specified in your test is as follows 
with separator as xxx with split length as 46.
{code}String inputData = abcxxxdefxxxghixxx
+ jklxxxmnoxxxpqrxxxstuxxxvw yz;{code}

Can you check again?

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2014-06-26 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044681#comment-14044681
 ] 

Rushabh S Shah commented on HADOOP-9867:


Hey Vinayakaumar,
Thanks for checking out the patch and providing valuable feedback.
I did ran into this test  case while solving this jira.
I am going to file another jira for this specific test case (and a couple of 
more which I came across) since the test case you mentioned is not in the scope 
of this jira.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2014-06-26 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044691#comment-14044691
 ] 

Vinayakumar B commented on HADOOP-9867:
---

I feel this case is related to thia jira also. 
Refer the example given by by jason in one of the above comments. 

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2014-06-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043559#comment-14043559
 ] 

Hadoop QA commented on HADOOP-9867:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12652422/HADOOP-9867.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/4168//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/4168//console

This message is automatically generated.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2014-02-26 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913017#comment-13913017
 ] 

Vinayakumar B commented on HADOOP-9867:
---

Hi jason
I was trying to implement the proposed solution you suggested. 
But I was facing issues. 
If you know the exact changes, can you please provide the patch.
Thanks

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-12-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844428#comment-13844428
 ] 

Jason Lowe commented on HADOOP-9867:


Thanks for updating the patch, Vinay.  Comments:

* I don't think LineReader is the best place to put split-specific code.  Its 
sole purpose is to read lines from an input stream regardless of split 
boundaries.  There are users of this class that are not necessarily processing 
splits.  That's why I created SplitLineReader in MapReduce, and I believe this 
logic is better placed there.
* I don't think we want to change Math.max(maxBytesToConsume(pos), 
maxLineLength)) to Math.min(maxBytesToConsume(pos), maxLineLength)).  We need 
to be able to read a record past the end of the split when the record crosses 
the split boundary, but I think this change could allow a truncated record to 
be returned for an uncompressed input stream. e.g.: fillBuffer happens to 
return data only up to the end of the split, record is incomplete (no delimiter 
found), but maxBytesToConsume keeps us from filling the buffer with more data 
and a truncated record is returned.

I think a more straightforward approach would be to have SplitLineReader be 
aware of the end of the split and track it in fillBuffer() much like 
CompressedLineSplitReader does.  The fillBuffer callback already indicates 
whether we're mid-delimiter or not, so we can simply check if fillBuffer is 
being called after the split has ended but we're mid-delimiter.  In that case 
we need an additional record.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinay
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-12-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843911#comment-13843911
 ] 

Hadoop QA commented on HADOOP-9867:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12617969/HADOOP-9867.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3350//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3350//console

This message is automatically generated.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinay
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-11-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13827567#comment-13827567
 ] 

Hadoop QA commented on HADOOP-9867:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12614864/HADOOP-9867.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

  org.apache.hadoop.mapred.TestJobCleanup

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3302//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/3302//console

This message is automatically generated.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinay
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-11-20 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13827795#comment-13827795
 ] 

Jason Lowe commented on HADOOP-9867:


Thanks for the patch, Vinay.  I think this approach can work when the input is 
uncompressed, however I don't think it will work for block-compressed inputs.  
Block codecs often report the file position as being the start of the codec 
block and then it teleports to the byte position of the next block once the 
first byte of the next block is consumed.  See HADOOP-9622 for a similar issue 
with the default delimiter and how it's being addressed.  Also 
getFilePosition() for a compressed input is returning a compressed stream 
offset, so if we try to do math on that with an uncompressed delimiter length 
we're mixing different units.

Since LineRecordReader::getFilePosition() can mean different things for 
different inputs, I think a better approach would be to change LineReader (not 
LineRecordReader) so the reported file position for multi-byte custom 
delimiters is the file position after the record but not including its 
delimiter.  Either that or wait for HADOOP-9622 to be committed and  update the 
SplitLineReader interface from the HADOOP-9622 patch so the uncompressed input 
reader would indicate an additional record needs to be read if the split ends 
mid-delimiter.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinay
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-11-20 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13828453#comment-13828453
 ] 

Vinay commented on HADOOP-9867:
---

Thanks Jason, I prefer waiting for HADOOP-9622 to be committed. 
Meanwhile I will try to update SplitLineReader offline. 

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinay
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-11-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826729#comment-13826729
 ] 

Jason Lowe commented on HADOOP-9867:


Ran across this JIRA while discussing the intricacies of HADOOP-9622.  There's 
a relatively straightforward testcase that demonstrates the issue.  With the 
following plaintext input

{code:title=customdeliminput.txt}
abcxxx
defxxx
ghixxx
jklxxx
mnoxxx
pqrxxx
stuxxx
vw xxx
xyzxxx
{code}

run a wordcount job like this:

{noformat}
hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar 
wordcount -Dmapreduce.input.fileinputformat.split.maxsize=33 
-Dtextinputformat.record.delimiter=xxx customdeliminput.txt wcout
{noformat}

and we can see that one of the records was dropped due to incorrect split 
processing:

{noformat}
$ hadoop fs -cat wcout/part-r-0   
abc 1
def 1
ghi 1
jkl 1
mno 1
stu 1
vw  1
xyz 1
{noformat}

I don't think rewinding the seek position by the delimiter length is correct in 
all cases.  I believe that will lead to duplicate records rather than dropped 
records (e.g.: split ends exactly when a delimiter ends, and both splits end up 
processing the record after that delimiter).

Instead we can get correct behavior by treating any split in the middle of a 
multibyte custom delimiter as if the delimiter ended exactly at the end of the 
split, i.e.: the consumer of the prior split is responsible for processing the 
divided delimiter and the subsequent record.  The consumer of the next split 
then tosses the first record up to the first full delimiter as usual (i.e.: 
including the partial delimiter at the beginning of the split) and proceeds to 
process any subsequent records.  That way we don't get any dropped records or 
duplicate records.

I think one way of accomplishing this is to have the LineReader for multibyte 
custom delimiters report the current position as the end of the record data 
*without* the delimiter bytes.  Then any record that ends exactly at the end of 
the split or whose delimiter straddles the split boundary will cause the prior 
split to consume the extra record necessary.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek

 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-08-13 Thread Kris Geusebroek (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738145#comment-13738145
 ] 

Kris Geusebroek commented on HADOOP-9867:
-

I created a Fix by adding the following code:

} else {
  if (start != 0) {
skipFirstLine = true;
+for (int i=0; i  recordDelimiter.length; i++) {
  --start;
+}
fileIn.seek(start);
  }

currently I'm testing this with a custom created subclass of LineRecordReader. 
If testing is OK, I'm willing to create a patch file if needed.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek

 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira