[ 
https://issues.apache.org/jira/browse/TEZ-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326419#comment-14326419
 ] 

Gopal V commented on TEZ-2001:
------------------------------

bq. Are we looking to address the memory overhead in this jira, or in a follow 
up ?

The eventing model overheads are  going to be a follow-up definitely.

That is a stats-quo problem, which is usually engineered by tuning splits till 
it is 1 spill-per-task. The added overheads are no different from spending a 
few hours of careful balancing to get an unskewed run today. 

This isn't making it any worse - in fact, unskewed runs are the best case 
scenario which works well today.

The crux of this JIRA is generation of & the downstream failure tolerance of 
the unmerged data. 

And the important idea that we are worried about are downstream task's 
tolerance - the upstream tasks completing successfully means "nothing has 
failed yet", without giving any additional information about the near future.

Production of all spills on a machine does not guarantee availability to all 
reducers - an NM crash or network partition will trigger the exact same 
re-execution codepath as a map task failure during pipelined shuffle.

In case of deterministic spills (like terasort), we can keep reducer state and 
pull the remainder off the new execution. 

In case of non-deterministic spills (hashmap aggregations), we are forced to 
drop all reducer state. This is due to the fact that to prevent massive number 
of tiny files on the reducer disk, the disk spill on a reducer is always via a 
merge, so trying to hold unmerged in-memory inputs would eventually occupy all 
the memory with the MergeManager stuck in WAIT.

These two scenarios are identical whether we fire 1 composite event (as bikas 
suggested) or individual events.

That is because these will result in un-coordinated fetches off the shuffle 
handler, due to the independence of the checksum streams of IFiles of different 
spills.

Optimizations which improve over the status quo of manual tuned runs can come 
after testing this out on a failure-injected cluster with a bad node always 
failing the nth spill shuffles (retries, black-listing etc).

> Support pipelined data transfer for ordered output
> --------------------------------------------------
>
>                 Key: TEZ-2001
>                 URL: https://issues.apache.org/jira/browse/TEZ-2001
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2001.1.patch, TEZ-2001.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to