Le Clue created PIG-4572:
----------------------------

             Summary: CSVExcelStorage treats newlines within fields as record 
seperator when input file is split
                 Key: PIG-4572
                 URL: https://issues.apache.org/jira/browse/PIG-4572
             Project: Pig
          Issue Type: Bug
          Components: piggybank
    Affects Versions: 0.14.0, 0.12.0
         Environment: Amazon ElasticMapReduce AMI 3.6.0
Apache Pig version 0.14.0 and 0.12.0
Hadoop 2.4.0
            Reporter: Le Clue


It seems that when a field enclosed by double-quotes contains a carriage return 
or linefeed, and the input file is bigger than the dfs blocksize, the input 
split does not honor CSVExcelStorage's treatment of newlines within fields.

It seems that the input is split by the linefeed closest to the byte range 
defined for the split, and causes fields to become skewed.

For example, 3190 Byte Text file containing 21 identical records such as the 
below:

"John Doe"~"025719e8244c7c400b811ea349f2c18e"~"This is a multiline message:
This is the second line.
Thank you for listening."~"2012-08-24 09:16:02"

Each line termination here is specified by a CRLF

Run through a pig script:
SET mapred.min.split.size 1024;
SET mapred.max.split.size 1024;
SET pig.noSplitCombination true;
SET mapred.max.jobs.per.node 1;
myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING 
org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'YES_MULTILINE','WINDOWS')
AS(
  name:chararray,
  sysid:chararray,
  message:chararray,
  messagedate:chararray
);
myinput_tuples = FOREACH myinput_file GENERATE name;
STORE myinput_tuples INTO '/output052/' USING 
org.apache.pig.piggybank.storage.CSVExcelStorage(',');

Results in 4 output files:

-rw-r--r--   1 hadoop supergroup          0 2015-05-26 07:19 /output052/_SUCCESS
-rw-r--r--   1 hadoop supergroup         63 2015-05-26 07:19 
/output052/part-m-00000
-rw-r--r--   1 hadoop supergroup         54 2015-05-26 07:19 
/output052/part-m-00001
-rw-r--r--   1 hadoop supergroup        769 2015-05-26 07:19 
/output052/part-m-00002
-rw-r--r--   1 hadoop supergroup         25 2015-05-26 07:19 
/output052/part-m-00003
[hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00000
John Doe
John Doe
John Doe
John Doe
John Doe
John Doe
John Doe
[hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00001
John Doe
John Doe
John Doe
John Doe
John Doe
John Doe
[hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00002
This is the second line.
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
[hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00003
This is the second line.

Skewing occurs on the third part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to