[
https://issues.apache.org/jira/browse/TEZ-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296580#comment-14296580
]
Rajesh Balamohan commented on TEZ-2001:
---------------------------------------
As and when incremental DME (per spill file) arrives, fetchers are allowed to
download the data (e.g assume sorter is going to churn out 4 spills segments in
PipelinedSorter. When a segment is spilled, DME event is sent out and fetcher
starts downloading this). The last DME can also processed in parallel in
consumer side. However, consumer ensures that all previous spills pertaining
to the attempt are downloaded before declaring success (i.e 4 DME events should
have been processed to declare that consumer has downloaded data from the
attempt). This will help in terms of downloading the data in parallel as the
data is getting generated in the source.
Merging happens in parallel (based on resource in memory or disk). When partial
data is downloaded, there is a potential chance that this data is merged and
the source task dies in middle. In subsequent jiras, we need to refactor
InMemory and Disk merges not to consider the partially downloaded data and it
should consider the attempts for which all data has been downloaded.
> Support pipelined data transfer for ordered output
> --------------------------------------------------
>
> Key: TEZ-2001
> URL: https://issues.apache.org/jira/browse/TEZ-2001
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2001.1.patch, TEZ-2001.2.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)