[
https://issues.apache.org/jira/browse/TEZ-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-2001:
----------------------------------
Attachment: TEZ-2001.1.patch
Similar to the approach listed in TEZ-1094, but specific to ordered usecases.
- As of now, pipelinedshuffle is enabled only when PipeledlinedSorter is used.
PipelinedSorter uses multiple threads and churns out sorted files.
- Can be enabled by setting "tez.runtime.pipelined-shuffle.enabled=true"
- Spills will be stored in
{code}${appDir}/output/${uniqueId}_${spillNumber}/file.out{code}. This would
make it easier to make use of existing ShuffleHandler to serve the output
without issues.
- Whenever a spill happens, DME is sent out with spill id. If 3 spills are
done, 3 events are sent out.
- On consumer side, this data is collated before completing the fetcher
threads.
- maxTaskAttempts is set to 1 when pipelined shuffle is enabled. Need to
create additional jiras to enhance error handling.
Overall this would be beneficial in cases, where map side spills are causing
the job runtime to suffer and pipelining helps in overlapping the networking
with CPU resources.
Attaching the initial patch with this.
> Support pipelined data transfer for ordered output
> --------------------------------------------------
>
> Key: TEZ-2001
> URL: https://issues.apache.org/jira/browse/TEZ-2001
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2001.1.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)