[jira] [Commented] (PIG-4572) CSVExcelStorage treats newlines within fields as record seperator when input file is split

Adam Szita (JIRA) Thu, 20 Oct 2016 07:51:18 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592024#comment-15592024
 ]


Adam Szita commented on PIG-4572:
---------------------------------

Hi, I've taken a deep look into this. Beware, long story ahead (TL;DR at bottom)

The problem roots from the way how Hadoop is loading text files line by line 
and how creates splits of them.
It doesn't matter that we're making CSVExcelStorage know what field (~), record 
(\r\n) delimeters and embedded line breaks are used in the data, Hadoop will 
not have an idea about CSV records and embedded line breaks when it comes to 
reading the text file into splits.

If not specified (by default it isn't) it will think that a normal line ending 
is the record delimeter, and use readDefaultLine method here: 
https://github.com/apache/hadoop/blob/release-2.4.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L169

In our case we want to set this property: textinputformat.record.delimiter to 
something like "\r\n" so that readCustomLine is used and splitting is done 
correctly. Now setting this isn't easy in Pig for reasons described here: 
http://aaron.blog.archive.org/2013/05/27/customizing-pig-for-sort-order-and-line-termination
I found the easiest way to be with the use of a property file which we supply 
to pig when it starts with the -P option.

Also, once we have it set for "\r\n", we'll see that however this is perfect 
for separating data in your data, it will strip quotes (") from record 
beginnings and ends which CSVExcelStorage would heavily depend on..

So what I came up with is to set it to *"\r\n* instead, it will keep the " char 
intact at record beginning, and only screw up the record ending one.
However this is not a problem if we specify CSVExcelStorage('~', 
'*NO_MULTILINE*','WINDOWS') (the fact that the record's buffer doesn't contain 
more charaters and NO_MULTILINE is defined will cause CSVExcelStorage to save 
the current buffer without needing the missing closing " - yes this is hacky in 
a way.. we can think of it as multiline-ness being handled by Hadoop already 
instead)

Summarized: try this:
-create property file with the following content and give it to Pig with -P 
option:
{code:title=myprops.properties|borderStyle=solid}
textinputformat.record.delimiter="\r\n
{code}
-use NO_MULTILINE option in CSVExcelStorage instead:
CSVExcelStorage('~', 'NO_MULTILINE','WINDOWS') 





> CSVExcelStorage treats newlines within fields as record seperator when input 
> file is split
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-4572
>                 URL: https://issues.apache.org/jira/browse/PIG-4572
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.12.0, 0.14.0
>         Environment: Amazon ElasticMapReduce AMI 3.6.0
> Apache Pig version 0.14.0 and 0.12.0
> Hadoop 2.4.0
>            Reporter: Le Clue
>            Assignee: Adam Szita
>              Labels: CSVExcelStorage, pig
>             Fix For: 0.17.0
>
>         Attachments: SmallTest.txt, script.pig
>
>
> It seems that when a field enclosed by double-quotes contains a carriage 
> return or linefeed, and the input file is bigger than the dfs blocksize, the 
> input split does not honor CSVExcelStorage's treatment of newlines within 
> fields.
> It seems that the input is split by the linefeed closest to the byte range 
> defined for the split, and causes fields to become skewed.
> For example, 3190 Byte Text file containing 21 identical records such as the 
> below:
> "John Doe"~"025719e8244c7c400b811ea349f2c18e"~"This is a multiline message:
> This is the second line.
> Thank you for listening."~"2012-08-24 09:16:02"
> Each line termination here is specified by a CRLF
> Run through a pig script:
> SET mapred.min.split.size 1024;
> SET mapred.max.split.size 1024;
> SET pig.noSplitCombination true;
> SET mapred.max.jobs.per.node 1;
> myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING 
> org.apache.pig.piggybank.storage.CSVExcelStorage('~', 
> 'YES_MULTILINE','WINDOWS')
> AS(
>   name:chararray,
>   sysid:chararray,
>   message:chararray,
>   messagedate:chararray
> );
> myinput_tuples = FOREACH myinput_file GENERATE name;
> STORE myinput_tuples INTO '/output052/' USING 
> org.apache.pig.piggybank.storage.CSVExcelStorage(',');
> Results in 4 output files:
> -rw-r--r--   1 hadoop supergroup          0 2015-05-26 07:19 
> /output052/_SUCCESS
> -rw-r--r--   1 hadoop supergroup         63 2015-05-26 07:19 
> /output052/part-m-00000
> -rw-r--r--   1 hadoop supergroup         54 2015-05-26 07:19 
> /output052/part-m-00001
> -rw-r--r--   1 hadoop supergroup        769 2015-05-26 07:19 
> /output052/part-m-00002
> -rw-r--r--   1 hadoop supergroup         25 2015-05-26 07:19 
> /output052/part-m-00003
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00000
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00001
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00002
> This is the second line.
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00003
> This is the second line.
> Skewing occurs on the third part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4572) CSVExcelStorage treats newlines within fields as record seperator when input file is split

Reply via email to