Jason Lowe created TEZ-3293: ------------------------------- Summary: Fetch failures can cause a shuffle hang waiting for memory merge that never starts Key: TEZ-3293 URL: https://issues.apache.org/jira/browse/TEZ-3293 Project: Apache Tez Issue Type: Bug Affects Versions: 0.8.3, 0.7.1 Reporter: Jason Lowe Assignee: Jason Lowe
Tez jobs can hang in shuffle waiting for a memory merge that never starts. When a MapOutput is reserved it increments usedMemory but when it is unreserved it decrements usedMemory _and_ commitMemory. If enough shuffle failures occur of sufficient size then commitMemory may never reach the merge threshold even after all outstanding transfers have committed and thus hang the shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)