[
https://issues.apache.org/jira/browse/TEZ-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219468#comment-16219468
]
Siddharth Seth commented on TEZ-3363:
-------------------------------------
Couple of comments.
1. When the data for a vertex is deleted, I think it'll be better to move it
into a different state, so that in case of failures / re-runs which require
data from this vertex, the vertex tasks can be re-run directly, instead of
relying on failures from the source to trigger re-runs of upstream tasks (how
slow/fast is this?). This can be problematic if the entire vertex ends up
re-running even if all data is not required by a downstream task. Ideally,
would be nice to re-run tasks when a downstream consumer requests this data.
2. On the vertex events - does the vertex make sure that every downstream
vertex at the specified depth is complete? May be easier to move this
co-ordination / selection of vertices for whcih data is to be deleted into the
DAG - whcih already gets VERTEX_COMPLETE events.
3. The configs could be collapsed into one - with negative values indicating
that the feature is disabled.
> Delete intermediate data at the vertex level for Shuffle Handler
> ----------------------------------------------------------------
>
> Key: TEZ-3363
> URL: https://issues.apache.org/jira/browse/TEZ-3363
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Jonathan Eagles
> Assignee: Kuhu Shukla
> Attachments: TEZ-3363.001.patch, TEZ-3363.002.patch
>
>
> For applications like pig where processing times can be very long,
> applications may choose to delete intermediate data for a sub dag. For
> example if a DAG has synced data to HDFS, all upstream intermediate data can
> be safely deleted.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)