[ 
https://issues.apache.org/jira/browse/PIG-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880482#comment-13880482
 ] 

Rohini Palaniswamy commented on PIG-3681:
-----------------------------------------

This is not a pig problem. If you pass it as 
-Dtextinputformat.record.delimiter=\n in command line shell passes it to pig as 
two chars - \ and n instead of a newline. If used directly in pig script, it is 
interpreted again that way as "\n" is not something pig parser understands as a 
newline.

bq. An OutofMemoryError is seen while using the 
textinputformat.record.delimiter property not a NullPointerException.
   You encounter OOM because pig tries to read the whole file as a single 
record as it did not encounter \ and n together as a delimiter.

cat delimiter.properties 
textinputformat.record.delimiter=\n

pig -P delimiter.properties test.pig

This works and can be used as a workaround.

Note:
 1) You will have a problem if the data is bzip as pig has its own 
implementation -
Bzip2TextInputFormat that does not take that setting into account. 
 2) The other problem you will have with textinputformat.record.delimiter=\n is
that you can only specify one delimiter. If line ends with \r\n (CRLF), then
the data will have \r in it.

> NullpointerException while processing files in gzip format
> ----------------------------------------------------------
>
>                 Key: PIG-3681
>                 URL: https://issues.apache.org/jira/browse/PIG-3681
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11
>         Environment: Linux CentOS 6.4; CDH-4.x and CDH-5.0 (beta)
>            Reporter: Kumar Ravi
>
> When pig processes a large gzip file with text or mixed text and binary 
> content, it throws a NullPointerException if the property 
> texinputformat.record.delimiter is set to '\n'. This is because pig 
> interprets the specified delimiter as a two character string "\" followed by 
> "n" and not as a new line character.
>  If this property is not set, same file unzips without problems, but the diff 
> output of file unzipped using pig and unzipped using the gunzip command 
> differs.
>  Steps to recreate:
> 1. create a text file that is ~ 4GB - I concatanated some pig/hadoop stdout 
> and syslog files to create this file about 4GB in size.
> 2. compress it on unix command line - Ex. gzip abc
> 3. upload to hdfs (optional)
> 4. run the pig script included below to read/write the file.
> pig --param job_name="gunzip abc" --param inputfile="abc.gz" --param 
> outputdir=./test --param outputfile=abc gunzip.pig
>  
> Here are the contents of gunzip.pig:
> set job.name '$job_name' 
> set textinputformat.record.delimiter "\n"; 
> gzdata = LOAD '$inputfile' USING PigStorage(); 
> STORE gzdata INTO '$outputdir/$outputfile' USING PigStorage();
> This will cause the NullPointerException.
> If the second line (set textinputformat.record.delimiter field) is commented 
> out, the Exception won't occur but the output is not the same as the one 
> produced by gunzip.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to