abstractdog commented on a change in pull request #60:
URL: https://github.com/apache/tez/pull/60#discussion_r823705314
##########
File path: tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
##########
@@ -883,6 +883,25 @@ public TezConfiguration(boolean loadDefaults) {
+ "dag.cleanup.on.completion";
public static final boolean TEZ_AM_DAG_CLEANUP_ON_COMPLETION_DEFAULT = false;
+ /**
+ * Boolean value. Instructs AM to delete vertex shuffle data if a vertex and
all its
+ * child vertices at a certain depth are completed.
+ */
+ @ConfigurationScope(Scope.AM)
+ @ConfigurationProperty(type="boolean")
+ public static final String TEZ_AM_VERTEX_CLEANUP_ON_COMPLETION =
TEZ_AM_PREFIX
+ + "vertex.cleanup.on.completion";
+ public static final boolean TEZ_AM_VERTEX_CLEANUP_ON_COMPLETION_DEFAULT =
false;
+
+ /**
+ * Int value. The height from the vertex that it can issue shuffle data
deletion upon completion
+ */
+ @ConfigurationScope(Scope.AM)
+ @ConfigurationProperty(type="integer")
+ public static final String TEZ_AM_VERTEX_CLEANUP_HEIGHT = TEZ_AM_PREFIX
+ + "vertex.cleanup.height";
+ public static final int TEZ_AM_VERTEX_CLEANUP_HEIGHT_DEFAULT = 1;
Review comment:
thanks, makes sense a couple of comments:
1. please include here a javadoc comment with an example DAG and different
values of height to be very straightforward to anyone reading the code
2. please include different considerations about height = 1 vs. height = 2,
so user can get some pointers...I mean, yeah even a single node failure can
cause problems, let's say:
2a) Reduce4 attempts fail to fetch from Reduce3, notifies AM
2b) after a certain point, AM decides to rerun the most blamed Reducer3 task
2c) the rerun Reducer3 task cannot fetch directly from Reduce2 output
(because it was deleted when Reducer3 finished + height=1)
so, unfortunately, this is not necessarily true:
```
The probability of shuffle data of great ancestor(s) being requested is very
rare
```
so, after thinking over this again, for me, it seems like height=2 would be
better default value to survive even a single node failure (btw: I found that
the early patches by Kuhu defined 2 as default)
what do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]