[ 
https://issues.apache.org/jira/browse/PIG-4649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654055#comment-14654055
 ] 

Rohini Palaniswamy commented on PIG-4649:
-----------------------------------------

Union directly followed by HCatStorer retains only half of the data. This is 
due to HCatStorer not honoring mapreduce.output.basename and always using part. 
Tez sets mapreduce.output.basename to part-v000-o000 (vertex id followed by 
output id). With union optimizer, Pig uses vertex groups to write directly from 
both the vertices to the final output directory. Since hcat ignores the 
mapreduce.output.basename, both the vertices produce part-r-0000<n> and when 
the files are moved from the temp location to the final directory, they just 
overwrite each other. There is no failure and only one of the files with that 
name makes it into the final directory.

> [Pig on Tez] Union followed by HCatStorer misses some data
> ----------------------------------------------------------
>
>                 Key: PIG-4649
>                 URL: https://issues.apache.org/jira/browse/PIG-4649
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.16.0
>
>
> Script to reproduce:
> {code}
> A = LOAD 'data01.txt' USING PigStorage() as (id:chararray, message:chararray);
> B = LOAD 'data02.txt' USING PigStorage() as (id:chararray, message:chararray);
> C = UNION A, B;
> STORE C INTO 'db.table1' USING org.apache.hive.hcatalog.pig.HCatStorer();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to