[
https://issues.apache.org/jira/browse/TEZ-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326251#comment-14326251
]
Siddharth Seth commented on TEZ-2001:
-------------------------------------
TEZ-1094 suggests an approach where transient events are used for intermediate
spills. However, that loses empty partition information as well. There's a
trade off between memory consumed on the AM + RPC event transfer vs the benefit
of empty partitions.
For pipelined transfer to work, events need to reach destination tasks as
they're generated - instead of all at once once all data has been generated.
One approach is to send all empty partition information with each event - and
discard all previous events.
Another is to completely drop empty partition support.
One more, suggested by [~gopalv] while discussing TEZ-1094 was to aggregate the
empty partitions across events. This is a balance between maintaining all empty
partition information and completely dropping empty partition support.
In any of the scenarios, event aggregation will be required within the AM -
unless we can get to a point where event storage is extremely memory efficient.
There's still the overhead of sending each event over RPC as against
potentially sending aggregated events.
Are we looking to address the memory overhead in this jira, or in a follow up ?
> Support pipelined data transfer for ordered output
> --------------------------------------------------
>
> Key: TEZ-2001
> URL: https://issues.apache.org/jira/browse/TEZ-2001
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2001.1.patch, TEZ-2001.2.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)