[ 
https://issues.apache.org/jira/browse/HADOOP-19091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821372#comment-17821372
 ] 

Venkatasubrahmanian Narayanan edited comment on HADOOP-19091 at 2/27/24 7:38 PM:
---------------------------------------------------------------------------------

[~srahman] I've uploaded my WIP Hive patch (there are a couple of other open 
sourced patches which need to be backported to Hive 3.1 that I've uploaded as 
well). I still need to clean up a couple of things (hence why the patch 
hardcodes an expectation that tables are on S3), but the basic idea is to add 
an MRv1 wrapper of the MagicS3GuardCommitter similar to how the 
FileOutputCommitter for MRv1 is implemented, and since Hive uses MRv1 it only 
requires incidental changes to treat paths the way the magic committer expects.

 

I was able to reproduce the behavior with a simple Pig load from csv - store 
into table with HCatStorer script on EMR 6-12.0. In the task and AM logs you 
can see the behavior I described where the path the task container writes the 
pending set to is subtly different from the path the AM tries to read it 
from(in my tests it differed by a single 0 appended after the first part of the 
jtIdentifier). The path is derived from the UUID, which in the default case is 
derived from the jobId. When I patch hadoop-aws to manually drop that extra 
digit from the jtIdentifier string the data is successfully committed(proving 
it's not any other factor at play), but obviously that approach would not work 
in a real solution.


was (Author: vnarayanan7):
[~srahman] I've uploaded my WIP Hive patch (there are a couple of other open 
sourced patches which need to be backported to Hive 3.1 that I've uploaded as 
well). I still need to clean up a couple of things (hence why the patch 
hardcodes an expectation that tables are on S3), but the basic idea is to add 
an MRv1 wrapper of the MagicS3GuardCommitter similar to how the 
FileOutputCommitter for MRv1 is implemented, and since Hive uses MRv1 it only 
requires incidental changes to treat paths the way the magic committer expects.

> Add support for Tez to MagicS3GuardCommitter
> --------------------------------------------
>
>                 Key: HADOOP-19091
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19091
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 3.4.0, 3.3.6
>         Environment: Pig 17/Hive 3.1.3 with Hadoop 3.3.3 on AWS EMR 6-12.0
>            Reporter: Venkatasubrahmanian Narayanan
>            Priority: Major
>         Attachments: 0001-AWS-Hive-Changes.patch, 
> 0002-HIVE-27698-Backport-of-HIVE-22398-Remove-legacy-code.patch, 
> HADOOP-19091-HIVE-WIP.patch
>
>
> The MagicS3GuardCommitter assumes that the JobID of the task is the same as 
> that of the job's application master when writing/reading the .pendingset 
> file. This assumption is not valid when running with Tez, which creates 
> slightly different JobIDs for tasks and the application master.
>  
> While the MagicS3GuardCommitter is intended only for MRv2, it mostly works 
> fine with an MRv1 wrapper with Hive/Pig (with some minor changes to Hive) run 
> in MR mode. This issue only crops up when running queries with the Tez 
> execution engine. I can upload a patch to Hive 3.1 to reproduce this error on 
> EMR if needed.
>  
> Fixing this will probably require work from both Tez and Hadoop, wanted to 
> start a discussion here so we can figure out how exactly we go about this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to