[jira] [Updated] (HIVE-7803) Enable Hadoop speculative execution may cause corrupt output directory (dynamic partition)

Damien Carol (JIRA) Wed, 20 Aug 2014 17:13:57 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-7803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Damien Carol updated HIVE-7803:
-------------------------------

    Description: 
One of our users reports they see intermittent failures due to attempt 
directories in the input paths. We found with speculative execution turned on, 
two mappers tried to commit task at the same time using the same committed task 
path,  which cause the corrupt output directory. 

The original Pig script:
{code}
STORE AdvertiserDataParsedClean INTO '$DB_NAME.$ADVERTISER_META_TABLE_NAME'
USING org.apache.hcatalog.pig.HCatStorer();
{code}

Two mappers
attempt_1405021984947_5394024_m_000523_0: KILLED
attempt_1405021984947_5394024_m_000523_1: SUCCEEDED

attempt_1405021984947_5394024_m_000523_0 was killed right after the commit.

As a result, it created corrupt directory as 
  
/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523/
containing 
   part-m-00523 (from attempt_1405021984947_5394024_m_000523_0)
and 
   attempt_1405021984947_5394024_m_000523_1/part-m-00523

Namenode Audit log
==========================
1. 2014-08-05 05:04:36,811 INFO FSNamesystem.audit: ugi=* ip=ipaddress1 
cmd=create 
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_0/part-m-00523
 dst=null  perm=user:group:rw-r-----

2. 2014-08-05 05:04:53,112 INFO FSNamesystem.audit: ugi=* ip=ipaddress2  
cmd=create 
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_1/part-m-00523
 dst=null  perm=user:group:rw-r-----

3. 2014-08-05 05:05:13,001 INFO FSNamesystem.audit: ugi=* ip=ipaddress1 
cmd=rename 
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_0
dst=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523
perm=user:group:rwxr-x---

4. 2014-08-05 05:05:13,004 INFO FSNamesystem.audit: ugi=* ip=ipaddress2  
cmd=rename 
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_1
dst=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523
perm=user:group:rwxr-x---

After consulting our Hadoop core team, we was pointed out some HCat code does 
not participating in the two-phase commit protocol, for example in 
FileRecordWriterContainer.close():
{code}
            for (Map.Entry<String, org.apache.hadoop.mapred.OutputCommitter> 
entry : baseDynamicCommitters.entrySet()) {
                org.apache.hadoop.mapred.TaskAttemptContext currContext = 
dynamicContexts.get(entry.getKey());
                OutputCommitter baseOutputCommitter = entry.getValue();
                if (baseOutputCommitter.needsTaskCommit(currContext)) {
                    baseOutputCommitter.commitTask(currContext);
                }
            }
{code}


  was:
One of our users reports they see intermittent failures due to attempt 
directories in the input paths. We found with speculative execution turned on, 
two mappers tried to commit task at the same time using the same committed task 
path,  which cause the corrupt output directory. 

The original Pig script:
(STORE AdvertiserDataParsedClean INTO '$DB_NAME.$ADVERTISER_META_TABLE_NAME'
USING org.apache.hcatalog.pig.HCatStorer();)

Two mappers
attempt_1405021984947_5394024_m_000523_0: KILLED
attempt_1405021984947_5394024_m_000523_1: SUCCEEDED

attempt_1405021984947_5394024_m_000523_0 was killed right after the commit.

As a result, it created corrupt directory as 
  
/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523/
containing 
   part-m-00523 (from attempt_1405021984947_5394024_m_000523_0)
and 
   attempt_1405021984947_5394024_m_000523_1/part-m-00523

Namenode Audit log
==========================
1. 2014-08-05 05:04:36,811 INFO FSNamesystem.audit: ugi=* ip=ipaddress1 
cmd=create 
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_0/part-m-00523
 dst=null  perm=user:group:rw-r-----

2. 2014-08-05 05:04:53,112 INFO FSNamesystem.audit: ugi=* ip=ipaddress2  
cmd=create 
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_1/part-m-00523
 dst=null  perm=user:group:rw-r-----

3. 2014-08-05 05:05:13,001 INFO FSNamesystem.audit: ugi=* ip=ipaddress1 
cmd=rename 
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_0
dst=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523
perm=user:group:rwxr-x---

4. 2014-08-05 05:05:13,004 INFO FSNamesystem.audit: ugi=* ip=ipaddress2  
cmd=rename 
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_1
dst=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523
perm=user:group:rwxr-x---

After consulting our Hadoop core team, we was pointed out some HCat code does 
not participating in the two-phase commit protocol, for example in 
FileRecordWriterContainer.close():

            for (Map.Entry<String, org.apache.hadoop.mapred.OutputCommitter>
entry : baseDynamicCommitters.entrySet()) {
                org.apache.hadoop.mapred.TaskAttemptContext currContext =
dynamicContexts.get(entry.getKey());
                OutputCommitter baseOutputCommitter = entry.getValue();
                if (baseOutputCommitter.needsTaskCommit(currContext)) {
                    baseOutputCommitter.commitTask(currContext);
                }
            }




> Enable Hadoop speculative execution may cause corrupt output directory 
> (dynamic partition)
> ------------------------------------------------------------------------------------------
>
>                 Key: HIVE-7803
>                 URL: https://issues.apache.org/jira/browse/HIVE-7803
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>    Affects Versions: 0.13.1
>         Environment: 
>            Reporter: Selina Zhang
>            Assignee: Selina Zhang
>            Priority: Critical
>         Attachments: HIVE-7803.1.patch
>
>
> One of our users reports they see intermittent failures due to attempt 
> directories in the input paths. We found with speculative execution turned 
> on, two mappers tried to commit task at the same time using the same 
> committed task path,  which cause the corrupt output directory. 
> The original Pig script:
> {code}
> STORE AdvertiserDataParsedClean INTO '$DB_NAME.$ADVERTISER_META_TABLE_NAME'
> USING org.apache.hcatalog.pig.HCatStorer();
> {code}
> Two mappers
> attempt_1405021984947_5394024_m_000523_0: KILLED
> attempt_1405021984947_5394024_m_000523_1: SUCCEEDED
> attempt_1405021984947_5394024_m_000523_0 was killed right after the commit.
> As a result, it created corrupt directory as 
>   
> /projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523/
> containing 
>    part-m-00523 (from attempt_1405021984947_5394024_m_000523_0)
> and 
>    attempt_1405021984947_5394024_m_000523_1/part-m-00523
> Namenode Audit log
> ==========================
> 1. 2014-08-05 05:04:36,811 INFO FSNamesystem.audit: ugi=* ip=ipaddress1 
> cmd=create 
> src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_0/part-m-00523
>  dst=null  perm=user:group:rw-r-----
> 2. 2014-08-05 05:04:53,112 INFO FSNamesystem.audit: ugi=* ip=ipaddress2  
> cmd=create 
> src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_1/part-m-00523
>  dst=null  perm=user:group:rw-r-----
> 3. 2014-08-05 05:05:13,001 INFO FSNamesystem.audit: ugi=* ip=ipaddress1 
> cmd=rename 
> src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_0
> dst=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523
> perm=user:group:rwxr-x---
> 4. 2014-08-05 05:05:13,004 INFO FSNamesystem.audit: ugi=* ip=ipaddress2  
> cmd=rename 
> src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_1
> dst=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523
> perm=user:group:rwxr-x---
> After consulting our Hadoop core team, we was pointed out some HCat code does 
> not participating in the two-phase commit protocol, for example in 
> FileRecordWriterContainer.close():
> {code}
>             for (Map.Entry<String, org.apache.hadoop.mapred.OutputCommitter> 
> entry : baseDynamicCommitters.entrySet()) {
>                 org.apache.hadoop.mapred.TaskAttemptContext currContext = 
> dynamicContexts.get(entry.getKey());
>                 OutputCommitter baseOutputCommitter = entry.getValue();
>                 if (baseOutputCommitter.needsTaskCommit(currContext)) {
>                     baseOutputCommitter.commitTask(currContext);
>                 }
>             }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HIVE-7803) Enable Hadoop speculative execution may cause corrupt output directory (dynamic partition)

Reply via email to