[ 
https://issues.apache.org/jira/browse/HIVE-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HIVE-8394:
---------------------------------------
    Status: Patch Available  (was: Open)

> HIVE-7803 doesn't handle Pig MultiQuery, can cause data-loss.
> -------------------------------------------------------------
>
>                 Key: HIVE-8394
>                 URL: https://issues.apache.org/jira/browse/HIVE-8394
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>    Affects Versions: 0.13.1, 0.12.0, 0.14.0
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>            Priority: Critical
>         Attachments: HIVE-8394.1.patch, HIVE-8394.2.patch
>
>
> We've found situations in production where Pig queries using {{HCatStorer}}, 
> dynamic partitioning and {{opt.multiquery=true}} that produce partitions in 
> the output table, but the corresponding directories have no data files (in 
> spite of Pig reporting non-zero records written to HDFS). I don't yet have a 
> distilled test-case for this.
> Here's the code from FileOutputCommitterContainer after HIVE-7803:
> {code:java|title=FileOutputCommitterContainer.java|borderStyle=dashed|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
>   @Override
>   public void commitTask(TaskAttemptContext context) throws IOException {
>     String jobInfoStr = 
> context.getConfiguration().get(FileRecordWriterContainer.DYN_JOBINFO);
>     if (!dynamicPartitioningUsed) {
>          //See HCATALOG-499
>       FileOutputFormatContainer.setWorkOutputPath(context);
>       
> getBaseOutputCommitter().commitTask(HCatMapRedUtil.createTaskAttemptContext(context));
>     } else if (jobInfoStr != null) {
>       ArrayList<String> jobInfoList = 
> (ArrayList<String>)HCatUtil.deserialize(jobInfoStr);
>       org.apache.hadoop.mapred.TaskAttemptContext currTaskContext = 
> HCatMapRedUtil.createTaskAttemptContext(context);
>       for (String jobStr : jobInfoList) {
>       OutputJobInfo localJobInfo = 
> (OutputJobInfo)HCatUtil.deserialize(jobStr);
>       FileOutputCommitter committer = new FileOutputCommitter(new 
> Path(localJobInfo.getLocation()), currTaskContext);
>       committer.commitTask(currTaskContext);
>       }
>     }
>   }
> {code}
> The serialized jobInfoList can't be retrieved, and hence the commit never 
> completes. This is because Pig's MapReducePOStoreImpl deliberately clones 
> both the TaskAttemptContext and the contained Configuration instance, thus 
> separating the Configuration instances passed to 
> {{FileOutputCommitterContainer::commitTask()}} and 
> {{FileRecordWriterContainer::close()}}. Anything set by the RecordWriter is 
> unavailable to the Committer.
> One approach would have been to store state in the FileOutputFormatContainer. 
> But that won't work since this is constructed via reflection in 
> HCatOutputFormat (itself constructed via reflection by PigOutputFormat via 
> HCatStorer). There's no guarantee that the instance is preserved.
> My only recourse seems to be to use a Singleton to store shared state. I'm 
> loath to indulge in this brand of shenanigans. (Statics and container-reuse 
> in Tez might not play well together, for instance.) It might work if we're 
> careful about tearing down the singleton.
> Any other ideas? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to