Jason Lowe created TEZ-3293:
-------------------------------

             Summary: Fetch failures can cause a shuffle hang waiting for 
memory merge that never starts
                 Key: TEZ-3293
                 URL: https://issues.apache.org/jira/browse/TEZ-3293
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.8.3, 0.7.1
            Reporter: Jason Lowe
            Assignee: Jason Lowe


Tez jobs can hang in shuffle waiting for a memory merge that never starts.  
When a MapOutput is reserved it increments usedMemory but when it is unreserved 
it decrements usedMemory _and_ commitMemory.  If enough shuffle failures occur 
of sufficient size then commitMemory may never reach the merge threshold even 
after all outstanding transfers have committed and thus hang the shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to