[jira] [Commented] (PIG-4572) CSVExcelStorage treats newlines within fields as record seperator when input file is split

Shawn Weeks (JIRA) Mon, 17 Oct 2016 20:28:39 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584300#comment-15584300
 ]


Shawn Weeks commented on PIG-4572:
----------------------------------

I've loaded several large 10GB+ files with embedded newlines and had it work 
when split but I'm starting to think it was blind luck that it didn't split on 
one of the embedded newlines. I'm facing this issue with a file where every 
line has an embedded newline in the same column and as luck would have it every 
split is on the embedded newline instead of the row delimiter newline.

> CSVExcelStorage treats newlines within fields as record seperator when input 
> file is split
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-4572
>                 URL: https://issues.apache.org/jira/browse/PIG-4572
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.12.0, 0.14.0
>         Environment: Amazon ElasticMapReduce AMI 3.6.0
> Apache Pig version 0.14.0 and 0.12.0
> Hadoop 2.4.0
>            Reporter: Le Clue
>              Labels: CSVExcelStorage, pig
>             Fix For: 0.17.0
>
>         Attachments: SmallTest.txt, script.pig
>
>
> It seems that when a field enclosed by double-quotes contains a carriage 
> return or linefeed, and the input file is bigger than the dfs blocksize, the 
> input split does not honor CSVExcelStorage's treatment of newlines within 
> fields.
> It seems that the input is split by the linefeed closest to the byte range 
> defined for the split, and causes fields to become skewed.
> For example, 3190 Byte Text file containing 21 identical records such as the 
> below:
> "John Doe"~"025719e8244c7c400b811ea349f2c18e"~"This is a multiline message:
> This is the second line.
> Thank you for listening."~"2012-08-24 09:16:02"
> Each line termination here is specified by a CRLF
> Run through a pig script:
> SET mapred.min.split.size 1024;
> SET mapred.max.split.size 1024;
> SET pig.noSplitCombination true;
> SET mapred.max.jobs.per.node 1;
> myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING 
> org.apache.pig.piggybank.storage.CSVExcelStorage('~', 
> 'YES_MULTILINE','WINDOWS')
> AS(
>   name:chararray,
>   sysid:chararray,
>   message:chararray,
>   messagedate:chararray
> );
> myinput_tuples = FOREACH myinput_file GENERATE name;
> STORE myinput_tuples INTO '/output052/' USING 
> org.apache.pig.piggybank.storage.CSVExcelStorage(',');
> Results in 4 output files:
> -rw-r--r--   1 hadoop supergroup          0 2015-05-26 07:19 
> /output052/_SUCCESS
> -rw-r--r--   1 hadoop supergroup         63 2015-05-26 07:19 
> /output052/part-m-00000
> -rw-r--r--   1 hadoop supergroup         54 2015-05-26 07:19 
> /output052/part-m-00001
> -rw-r--r--   1 hadoop supergroup        769 2015-05-26 07:19 
> /output052/part-m-00002
> -rw-r--r--   1 hadoop supergroup         25 2015-05-26 07:19 
> /output052/part-m-00003
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00000
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00001
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00002
> This is the second line.
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00003
> This is the second line.
> Skewing occurs on the third part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4572) CSVExcelStorage treats newlines within fields as record seperator when input file is split

Reply via email to