Le Clue created PIG-4572:
----------------------------
Summary: CSVExcelStorage treats newlines within fields as record
seperator when input file is split
Key: PIG-4572
URL: https://issues.apache.org/jira/browse/PIG-4572
Project: Pig
Issue Type: Bug
Components: piggybank
Affects Versions: 0.14.0, 0.12.0
Environment: Amazon ElasticMapReduce AMI 3.6.0
Apache Pig version 0.14.0 and 0.12.0
Hadoop 2.4.0
Reporter: Le Clue
It seems that when a field enclosed by double-quotes contains a carriage return
or linefeed, and the input file is bigger than the dfs blocksize, the input
split does not honor CSVExcelStorage's treatment of newlines within fields.
It seems that the input is split by the linefeed closest to the byte range
defined for the split, and causes fields to become skewed.
For example, 3190 Byte Text file containing 21 identical records such as the
below:
"John Doe"~"025719e8244c7c400b811ea349f2c18e"~"This is a multiline message:
This is the second line.
Thank you for listening."~"2012-08-24 09:16:02"
Each line termination here is specified by a CRLF
Run through a pig script:
SET mapred.min.split.size 1024;
SET mapred.max.split.size 1024;
SET pig.noSplitCombination true;
SET mapred.max.jobs.per.node 1;
myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING
org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'YES_MULTILINE','WINDOWS')
AS(
name:chararray,
sysid:chararray,
message:chararray,
messagedate:chararray
);
myinput_tuples = FOREACH myinput_file GENERATE name;
STORE myinput_tuples INTO '/output052/' USING
org.apache.pig.piggybank.storage.CSVExcelStorage(',');
Results in 4 output files:
-rw-r--r-- 1 hadoop supergroup 0 2015-05-26 07:19 /output052/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 63 2015-05-26 07:19
/output052/part-m-00000
-rw-r--r-- 1 hadoop supergroup 54 2015-05-26 07:19
/output052/part-m-00001
-rw-r--r-- 1 hadoop supergroup 769 2015-05-26 07:19
/output052/part-m-00002
-rw-r--r-- 1 hadoop supergroup 25 2015-05-26 07:19
/output052/part-m-00003
[hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00000
John Doe
John Doe
John Doe
John Doe
John Doe
John Doe
John Doe
[hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00001
John Doe
John Doe
John Doe
John Doe
John Doe
John Doe
[hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00002
This is the second line.
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
[hadoop@ip-10-102-154-33 ~]$ hadoop fs -cat /output052/part-m-00003
This is the second line.
Skewing occurs on the third part.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)