[
https://issues.apache.org/jira/browse/HIVE-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194153#comment-14194153
]
Lefty Leverenz commented on HIVE-8394:
--------------------------------------
Thanks for saving me a bit of sleuthing, [~sushanth]! Maybe you'll start a
trend....
> HIVE-7803 doesn't handle Pig MultiQuery, can cause data-loss.
> -------------------------------------------------------------
>
> Key: HIVE-8394
> URL: https://issues.apache.org/jira/browse/HIVE-8394
> Project: Hive
> Issue Type: Bug
> Components: HCatalog
> Affects Versions: 0.12.0, 0.14.0, 0.13.1
> Reporter: Mithun Radhakrishnan
> Assignee: Mithun Radhakrishnan
> Priority: Critical
> Fix For: 0.14.0
>
> Attachments: HIVE-8394.1.patch, HIVE-8394.2.patch, HIVE-8394.3.patch,
> HIVE-8394.4.patch
>
>
> We've found situations in production where Pig queries using {{HCatStorer}},
> dynamic partitioning and {{opt.multiquery=true}} that produce partitions in
> the output table, but the corresponding directories have no data files (in
> spite of Pig reporting non-zero records written to HDFS). I don't yet have a
> distilled test-case for this.
> Here's the code from FileOutputCommitterContainer after HIVE-7803:
> {code:java|title=FileOutputCommitterContainer.java|borderStyle=dashed|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
> @Override
> public void commitTask(TaskAttemptContext context) throws IOException {
> String jobInfoStr =
> context.getConfiguration().get(FileRecordWriterContainer.DYN_JOBINFO);
> if (!dynamicPartitioningUsed) {
> //See HCATALOG-499
> FileOutputFormatContainer.setWorkOutputPath(context);
>
> getBaseOutputCommitter().commitTask(HCatMapRedUtil.createTaskAttemptContext(context));
> } else if (jobInfoStr != null) {
> ArrayList<String> jobInfoList =
> (ArrayList<String>)HCatUtil.deserialize(jobInfoStr);
> org.apache.hadoop.mapred.TaskAttemptContext currTaskContext =
> HCatMapRedUtil.createTaskAttemptContext(context);
> for (String jobStr : jobInfoList) {
> OutputJobInfo localJobInfo =
> (OutputJobInfo)HCatUtil.deserialize(jobStr);
> FileOutputCommitter committer = new FileOutputCommitter(new
> Path(localJobInfo.getLocation()), currTaskContext);
> committer.commitTask(currTaskContext);
> }
> }
> }
> {code}
> The serialized jobInfoList can't be retrieved, and hence the commit never
> completes. This is because Pig's MapReducePOStoreImpl deliberately clones
> both the TaskAttemptContext and the contained Configuration instance, thus
> separating the Configuration instances passed to
> {{FileOutputCommitterContainer::commitTask()}} and
> {{FileRecordWriterContainer::close()}}. Anything set by the RecordWriter is
> unavailable to the Committer.
> One approach would have been to store state in the FileOutputFormatContainer.
> But that won't work since this is constructed via reflection in
> HCatOutputFormat (itself constructed via reflection by PigOutputFormat via
> HCatStorer). There's no guarantee that the instance is preserved.
> My only recourse seems to be to use a Singleton to store shared state. I'm
> loath to indulge in this brand of shenanigans. (Statics and container-reuse
> in Tez might not play well together, for instance.) It might work if we're
> careful about tearing down the singleton.
> Any other ideas?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)