[ 
https://issues.apache.org/jira/browse/TEZ-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219468#comment-16219468
 ] 

Siddharth Seth commented on TEZ-3363:
-------------------------------------

Couple of comments.
1. When the data for a vertex is deleted, I think it'll be better to move it 
into a different state, so that in case of failures / re-runs which require 
data from this vertex, the vertex tasks can be re-run directly, instead of 
relying on failures from the source to trigger re-runs of upstream tasks (how 
slow/fast is this?). This can be problematic if the entire vertex ends up 
re-running even if all data is not required by a downstream task. Ideally, 
would be nice to re-run tasks when a downstream consumer requests this data.
2. On the vertex events - does the vertex make sure that every downstream 
vertex at the specified depth is complete? May be easier to move this 
co-ordination / selection of vertices for whcih data is to be deleted into the 
DAG - whcih already gets VERTEX_COMPLETE events.
3. The configs could be collapsed into one - with negative values indicating 
that the feature is disabled.


> Delete intermediate data at the vertex level for Shuffle Handler
> ----------------------------------------------------------------
>
>                 Key: TEZ-3363
>                 URL: https://issues.apache.org/jira/browse/TEZ-3363
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jonathan Eagles
>            Assignee: Kuhu Shukla
>         Attachments: TEZ-3363.001.patch, TEZ-3363.002.patch
>
>
> For applications like pig where processing times can be very long, 
> applications may choose to delete intermediate data for a sub dag. For 
> example if a DAG has synced data to HDFS, all upstream intermediate data can 
> be safely deleted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to