Selina Zhang created HIVE-7803:
----------------------------------
Summary: Enable Hadoop speculative execution may cause corrupt
output directory (dynamic partition)
Key: HIVE-7803
URL: https://issues.apache.org/jira/browse/HIVE-7803
Project: Hive
Issue Type: Bug
Components: HCatalog
Affects Versions: 0.13.1
Environment:
Reporter: Selina Zhang
Assignee: Selina Zhang
Priority: Critical
One of our users reports they see intermittent failures due to attempt
directories in the input paths. We found with speculative execution turned on,
two mappers tried to commit task at the same time using the same committed task
path, which cause the corrupt output directory.
The original Pig script:
(STORE AdvertiserDataParsedClean INTO '$DB_NAME.$ADVERTISER_META_TABLE_NAME'
USING org.apache.hcatalog.pig.HCatStorer();)
Two mappers
attempt_1405021984947_5394024_m_000523_0: KILLED
attempt_1405021984947_5394024_m_000523_1: SUCCEEDED
attempt_1405021984947_5394024_m_000523_0 was killed right after the commit.
As a result, it created corrupt directory as
/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523/
containing
part-m-00523 (from attempt_1405021984947_5394024_m_000523_0)
and
attempt_1405021984947_5394024_m_000523_1/part-m-00523
Namenode Audit log
==========================
1. 2014-08-05 05:04:36,811 INFO FSNamesystem.audit: ugi=* ip=ipaddress1
cmd=create
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_0/part-m-00523
dst=null perm=user:group:rw-r-----
2. 2014-08-05 05:04:53,112 INFO FSNamesystem.audit: ugi=* ip=ipaddress2
cmd=create
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_1/part-m-00523
dst=null perm=user:group:rw-r-----
3. 2014-08-05 05:05:13,001 INFO FSNamesystem.audit: ugi=* ip=ipaddress1
cmd=rename
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_0
dst=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523
perm=user:group:rwxr-x---
4. 2014-08-05 05:05:13,004 INFO FSNamesystem.audit: ugi=* ip=ipaddress2
cmd=rename
src=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/_temporary/attempt_1405021984947_5394024_m_000523_1
dst=/projects/.../tablename/_DYN0.7192688458252056/load_time=201408050000/type=complete/_temporary/1/task_1405021984947_5394024_m_000523
perm=user:group:rwxr-x---
After consulting our Hadoop core team, we was pointed out some HCat code does
not participating in the two-phase commit protocol, for example in
FileRecordWriterContainer.close():
for (Map.Entry<String, org.apache.hadoop.mapred.OutputCommitter>
entry : baseDynamicCommitters.entrySet()) {
org.apache.hadoop.mapred.TaskAttemptContext currContext =
dynamicContexts.get(entry.getKey());
OutputCommitter baseOutputCommitter = entry.getValue();
if (baseOutputCommitter.needsTaskCommit(currContext)) {
baseOutputCommitter.commitTask(currContext);
}
}
--
This message was sent by Atlassian JIRA
(v6.2#6252)