[
https://issues.apache.org/jira/browse/PIG-4649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654055#comment-14654055
]
Rohini Palaniswamy commented on PIG-4649:
-----------------------------------------
Union directly followed by HCatStorer retains only half of the data. This is
due to HCatStorer not honoring mapreduce.output.basename and always using part.
Tez sets mapreduce.output.basename to part-v000-o000 (vertex id followed by
output id). With union optimizer, Pig uses vertex groups to write directly from
both the vertices to the final output directory. Since hcat ignores the
mapreduce.output.basename, both the vertices produce part-r-0000<n> and when
the files are moved from the temp location to the final directory, they just
overwrite each other. There is no failure and only one of the files with that
name makes it into the final directory.
> [Pig on Tez] Union followed by HCatStorer misses some data
> ----------------------------------------------------------
>
> Key: PIG-4649
> URL: https://issues.apache.org/jira/browse/PIG-4649
> Project: Pig
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
>
> Script to reproduce:
> {code}
> A = LOAD 'data01.txt' USING PigStorage() as (id:chararray, message:chararray);
> B = LOAD 'data02.txt' USING PigStorage() as (id:chararray, message:chararray);
> C = UNION A, B;
> STORE C INTO 'db.table1' USING org.apache.hive.hcatalog.pig.HCatStorer();
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)