[
https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661142#comment-14661142
]
Hitesh Shah edited comment on TEZ-2003 at 8/7/15 12:57 AM:
-----------------------------------------------------------
Spell check:
- logErrorIngored, hearbeats, getCurretnDagName
Fix imports:
- remove “*” e,g, import org.apache.tez.common.asterisk;
abortTask vs close/cleanup
- are both close and cleanup being invoked in both TezTaskRunner and Runner2
even when a task is aborted? ( need to re-check code )
TezTaskRunner2
{code}
* If a kill request is made to the task, it will not communicate this
information to
* the AM - since a task KILL is an external event, and whoever
invoked it should
* be able to track it.
{code}
- shouldn’t the AM be informed that the task has completed after the abort
related steps have been done?
{code}
private boolean isRunningState() {
return !taskComplete.get() &&
!killTaskRequested.get() && !stopContainerRequested.get() &&
!errorSeen.get();
}
{code}
- any possibility of simplifying this? Each get is atomic but the combined
AND expression is not atomic.
- any reason why the task runner needs to know that the container stop is
requested? For a task runner, it should only know about managing a task and not
the container. TezChild should handle the container stop handling instead?
{code}
// Don't throw an error since the task is already in the process of shutting
down.
LOG.info("returning canCommit=false since task is not in a running
state");
return false;
{code}
- why would a task call canCommit while shutting down? Shouldn’t we throw
an exception anyway as it is not meant to be called during shutdown?
TaskReporter
- some functions such as shutdown not synchronized?
tez-ext-service-tests/ShuffleHandler
- Not sure why we need a complex impl of a shuffle handler for a regression
test? If this is meant for a benchmark tool, maybe it should categorized as
such and become more maintstream?
tez-ext-service-tests
- if the ext-service layer is meant to be a full fledged plugin, it would be
good if this became the defacto guide on how to implement the necessary plugin
layers to talk to a simple 3rd party service. Current approach seems quite
complex.
JoinValidate:
- how does one actually use this example with configured external execution
engines? The change is added but no real docs or information on how it can
actually be run to demonstrate the feature.
TezTaskCommunicatorImpl
- Assuming this is the internal communicator used by Tez between the AM and
YARN containers, why does it needs a user payload and initialization from the
user payload?
TaskCommunicator
- has an initialize() but user payload seems to be parsed in the constructor
in TezTaskCommunicatorImpl
TaskCommunicatorContextImpl
{code}
public boolean isKnownContainer(ContainerId containerId) {
return
context.getAllContainers().get(containerId) != null;
}
{code}
- Shouldn’t each plugin manage its own containers? Or at least shouldn’t
this query be done based on which launcher plugin was being used for the given
container? Likewise for containerAlive().
APIs in TaskCommunicatorContextImpl probably some minor renaming - say
something like notifyTaskFailure() maybe?
getCurretnDagName() - dag can be null. Not handled. Likewise for
getInputVertexNames() and other apis where both dag, vertex ( even task ) could
be null. Needs more defensive programming against bad plugins.
{code}
public void onStateUpdated(VertexStateUpdate event)
{code}
- in this scenario, is the TaskCommunicator using the context to tell the AM
that a vertex has completed?
dagCompleteStart() - not sure what this API is meant to signify. It seems like
the context is being notified that the DAG has started?
Is there a need for the framework to make updates into the Context object? If
yes, should the Context implement 2 interfaces? Should the internal objects
just bind to the internal Impl objects or are they bound to the public plugin
interfaces to catch compat errors? Binding to Impls directly may mean a smaller
public API interface.
TaskAttemptListenerImpTezDag
ctor.setAccessible(true);
- why is this needed? Why not throw an error?
General design question:
- work preserving mode
- not for this jira but any general thoughts on how work preserving restarts
will work? For that matter, considering an LLAP service will not go down if the
AM crashes, how can work/tasks be recovered without needing to re-run them?
was (Author: hitesh):
Spell check:
- logErrorIngored, hearbeats, getCurretnDagName
Fix imports:
- remove “*” e,g, import org.apache.tez.common.*;
abortTask vs close/cleanup
- are both close and cleanup being invoked in both TezTaskRunner and Runner2
even when a task is aborted? ( need to re-check code )
TezTaskRunner2
{code}
* If a kill request is made to the task, it will not communicate this
information to
* the AM - since a task KILL is an external event, and whoever
invoked it should
* be able to track it.
{code}
- shouldn’t the AM be informed that the task has completed after the abort
related steps have been done?
{code}
private boolean isRunningState() {
return !taskComplete.get() &&
!killTaskRequested.get() && !stopContainerRequested.get() &&
!errorSeen.get();
}
{code}
- any possibility of simplifying this? Each get is atomic but the combined
AND expression is not atomic.
- any reason why the task runner needs to know that the container stop is
requested? For a task runner, it should only know about managing a task and not
the container. TezChild should handle the container stop handling instead?
{code}
// Don't throw an error since the task is already in the process of shutting
down.
LOG.info("returning canCommit=false since task is not in a running
state");
return false;
{code}
- why would a task call canCommit while shutting down? Shouldn’t we throw
an exception anyway as it is not meant to be called during shutdown?
TaskReporter
- some functions such as shutdown not synchronized?
tez-ext-service-tests/ShuffleHandler
- Not sure why we need a complex impl of a shuffle handler for a regression
test? If this is meant for a benchmark tool, maybe it should categorized as
such and become more maintstream?
tez-ext-service-tests
- if the ext-service layer is meant to be a full fledged plugin, it would be
good if this became the defacto guide on how to implement the necessary plugin
layers to talk to a simple 3rd party service. Current approach seems quite
complex.
JoinValidate:
- how does one actually use this example with configured external execution
engines? The change is added but no real docs or information on how it can
actually be run to demonstrate the feature.
TezTaskCommunicatorImpl
- Assuming this is the internal communicator used by Tez between the AM and
YARN containers, why does it needs a user payload and initialization from the
user payload?
TaskCommunicator
- has an initialize() but user payload seems to be parsed in the constructor
in TezTaskCommunicatorImpl
TaskCommunicatorContextImpl
{code}
public boolean isKnownContainer(ContainerId containerId) {
return
context.getAllContainers().get(containerId) != null;
}
{code}
- Shouldn’t each plugin manage its own containers? Or at least shouldn’t
this query be done based on which launcher plugin was being used for the given
container? Likewise for containerAlive().
APIs in TaskCommunicatorContextImpl probably some minor renaming - say
something like notifyTaskFailure() maybe?
getCurretnDagName() - dag can be null. Not handled. Likewise for
getInputVertexNames() and other apis where both dag, vertex ( even task ) could
be null. Needs more defensive programming against bad plugins.
{code}
public void onStateUpdated(VertexStateUpdate event)
{code}
- in this scenario, is the TaskCommunicator using the context to tell the AM
that a vertex has completed?
dagCompleteStart() - not sure what this API is meant to signify. It seems like
the context is being notified that the DAG has started?
Is there a need for the framework to make updates into the Context object? If
yes, should the Context implement 2 interfaces? Should the internal objects
just bind to the internal Impl objects or are they bound to the public plugin
interfaces to catch compat errors? Binding to Impls directly may mean a smaller
public API interface.
TaskAttemptListenerImpTezDag
ctor.setAccessible(true);
- why is this needed? Why not throw an error?
General design question:
- work preserving mode
- not for this jira but any general thoughts on how work preserving restarts
will work? For that matter, considering an LLAP service will not go down if the
AM crashes, how can work/tasks be recovered without needing to re-run them?
> [Umbrella] Allow Tez to co-ordinate execution to external services
> ------------------------------------------------------------------
>
> Key: TEZ-2003
> URL: https://issues.apache.org/jira/browse/TEZ-2003
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Attachments: 2003_20150728.1.txt, Tez With External Services.pdf
>
>
> The Tez engine itself takes care of co-ordinating execution - controlling how
> data gets routed (different connection patterns), fault tolerance, scheduling
> of work, etc.
> This is currently tied to TaskSpecs defined within Tez and on containers
> launched by Tez itself (TezChild).
> The proposal is to allow Tez to work with external services instead of just
> containers launched by Tez. This involves several more pluggable layers to
> work with alternate Task Specifications, custom launch and task allocation
> mechanics, as well as custom scheduling sources.
> A simple example would be a simple a process with the capability to execute
> multiple Tez TaskSpecs as threads. In such a case, a container launch isn't
> really need and can be mocked. Sourcing / scheduling containers would need to
> be pluggable.
> A more advanced example would be LLAP (HIVE-7926;
> https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf).
> This works with custom interfaces - which would need to be supported by Tez,
> along with a custom event model which would need translation hooks.
> Tez should be able to work with a combination of certain vertices running in
> external services and others running in regular Tez containers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)