[ https://issues.apache.org/jira/browse/PIG-4572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15643503#comment-15643503 ]
Adam Szita commented on PIG-4572: --------------------------------- Resolving this now - feel free to reopen if you don't find this conclusive > CSVExcelStorage treats newlines within fields as record seperator when input > file is split > ------------------------------------------------------------------------------------------ > > Key: PIG-4572 > URL: https://issues.apache.org/jira/browse/PIG-4572 > Project: Pig > Issue Type: Bug > Components: piggybank > Affects Versions: 0.12.0, 0.14.0 > Environment: Amazon ElasticMapReduce AMI 3.6.0 > Apache Pig version 0.14.0 and 0.12.0 > Hadoop 2.4.0 > Reporter: Le Clue > Assignee: Adam Szita > Labels: CSVExcelStorage, pig > Fix For: 0.17.0 > > Attachments: SmallTest.txt, script.pig > > > It seems that when a field enclosed by double-quotes contains a carriage > return or linefeed, and the input file is bigger than the dfs blocksize, the > input split does not honor CSVExcelStorage's treatment of newlines within > fields. > It seems that the input is split by the linefeed closest to the byte range > defined for the split, and causes fields to become skewed. > For example, 3190 Byte Text file containing 21 identical records such as the > below: > "John Doe"~"025719e8244c7c400b811ea349f2c18e"~"This is a multiline message: > This is the second line. > Thank you for listening."~"2012-08-24 09:16:02" > Each line termination here is specified by a CRLF > Run through a pig script: > SET mapred.min.split.size 1024; > SET mapred.max.split.size 1024; > SET pig.noSplitCombination true; > SET mapred.max.jobs.per.node 1; > myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING > org.apache.pig.piggybank.storage.CSVExcelStorage('~', > 'YES_MULTILINE','WINDOWS') > AS( > name:chararray, > sysid:chararray, > message:chararray, > messagedate:chararray > ); > myinput_tuples = FOREACH myinput_file GENERATE name; > STORE myinput_tuples INTO '/output052/' USING > org.apache.pig.piggybank.storage.CSVExcelStorage(','); > Results in 4 output files: > -rw-r--r-- 1 hadoop supergroup 0 2015-05-26 07:19 > /output052/_SUCCESS > -rw-r--r-- 1 hadoop supergroup 63 2015-05-26 07:19 > /output052/part-m-00000 > -rw-r--r-- 1 hadoop supergroup 54 2015-05-26 07:19 > /output052/part-m-00001 > -rw-r--r-- 1 hadoop supergroup 769 2015-05-26 07:19 > /output052/part-m-00002 > -rw-r--r-- 1 hadoop supergroup 25 2015-05-26 07:19 > /output052/part-m-00003 > [hadoop@master~]$ hadoop fs -cat /output052/part-m-00000 > John Doe > John Doe > John Doe > John Doe > John Doe > John Doe > John Doe > [hadoop@master~]$ hadoop fs -cat /output052/part-m-00001 > John Doe > John Doe > John Doe > John Doe > John Doe > John Doe > [hadoop@master~]$ hadoop fs -cat /output052/part-m-00002 > This is the second line. > "Thank you for listening.~2012-08-24 09:16:02"" > John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:" > "Thank you for listening.~2012-08-24 09:16:02"" > John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:" > "Thank you for listening.~2012-08-24 09:16:02"" > John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:" > "Thank you for listening.~2012-08-24 09:16:02"" > John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:" > "Thank you for listening.~2012-08-24 09:16:02"" > John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:" > "Thank you for listening.~2012-08-24 09:16:02"" > John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:" > [hadoop@master~]$ hadoop fs -cat /output052/part-m-00003 > This is the second line. > Skewing occurs on the third part. -- This message was sent by Atlassian JIRA (v6.3.4#6332)