[jira] [Comment Edited] (TEZ-2003) [Umbrella] Allow Tez to co-ordinate execution to external services

Hitesh Shah (JIRA) Thu, 06 Aug 2015 17:59:12 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661142#comment-14661142
 ]


Hitesh Shah edited comment on TEZ-2003 at 8/7/15 12:57 AM:
-----------------------------------------------------------

Spell check:
  - logErrorIngored, hearbeats, getCurretnDagName

Fix imports:
   - remove “*” e,g, import org.apache.tez.common.asterisk; 

abortTask vs close/cleanup
   - are both close and cleanup being invoked in both TezTaskRunner and Runner2 
even when a task is aborted? ( need to re-check code )

TezTaskRunner2

{code}
* If a kill request is made to the task, it will not communicate this 
information to * the AM - since a task KILL is an external event, and whoever 
invoked it should * be able to track it.
{code}
   - shouldn’t the AM be informed that the task has completed after the abort 
related steps have been done? 


{code}
private boolean isRunningState() {   return !taskComplete.get() && 
!killTaskRequested.get() && !stopContainerRequested.get() &&       
!errorSeen.get(); }
{code}
    - any possibility of simplifying this? Each get is atomic but the combined 
AND expression is not atomic. 
    - any reason why the task runner needs to know that the container stop is 
requested? For a task runner, it should only know about managing a task and not 
the container. TezChild should handle the container stop handling instead?

{code}
// Don't throw an error since the task is already in the process of shutting 
down. LOG.info("returning canCommit=false since task is not in a running 
state"); return false;
{code}
    - why would a task call canCommit while shutting down? Shouldn’t we throw 
an exception anyway as it is not meant to be called during shutdown?

TaskReporter
   - some functions such as shutdown not synchronized?

tez-ext-service-tests/ShuffleHandler
   - Not sure why we need a complex impl of a shuffle handler for a regression 
test? If this is meant for a benchmark tool, maybe it should categorized as 
such and become more maintstream?

tez-ext-service-tests
   - if the ext-service layer is meant to be a full fledged plugin, it would be 
good if this became the defacto guide on how to implement the necessary plugin 
layers to talk to a simple 3rd party service. Current approach seems quite 
complex. 

JoinValidate:
   - how does one actually use this example with configured external execution 
engines? The change is added but no real docs or information on how it can 
actually be run to demonstrate the feature. 

TezTaskCommunicatorImpl
   - Assuming this is the internal communicator used by Tez between the AM and 
YARN containers, why does it needs a user payload and initialization from the 
user payload? 

TaskCommunicator
   - has an initialize() but user payload seems to be parsed in the constructor 
in TezTaskCommunicatorImpl

TaskCommunicatorContextImpl

{code}
public boolean isKnownContainer(ContainerId containerId) {   return 
context.getAllContainers().get(containerId) != null; }
{code}
   - Shouldn’t each plugin manage its own containers? Or at least shouldn’t 
this query be done based on which launcher plugin was being used for the given 
container? Likewise for containerAlive().

APIs in TaskCommunicatorContextImpl probably some minor renaming - say 
something like notifyTaskFailure() maybe?

getCurretnDagName() - dag can be null. Not handled. Likewise for 
getInputVertexNames() and other apis where both dag, vertex ( even task ) could 
be null. Needs more defensive programming against bad plugins.

{code}
public void onStateUpdated(VertexStateUpdate event)
{code}
  - in this scenario, is the TaskCommunicator using the context to tell the AM 
that a vertex has completed?

dagCompleteStart() - not sure what this API is meant to signify. It seems like 
the context is being notified that the DAG has started?

Is there a need for the framework to make updates into the Context object? If 
yes, should the Context implement 2 interfaces? Should the internal objects 
just bind to the internal Impl objects or are they bound to the public plugin 
interfaces to catch compat errors? Binding to Impls directly may mean a smaller 
public API interface.

TaskAttemptListenerImpTezDag

ctor.setAccessible(true); 
   - why is this needed? Why not throw an error? 

General design question:
   - work preserving mode
   - not for this jira but any general thoughts on how work preserving restarts 
will work? For that matter, considering an LLAP service will not go down if the 
AM crashes, how can work/tasks be recovered without needing to re-run them?

 













 









 










was (Author: hitesh):
Spell check:
  - logErrorIngored, hearbeats, getCurretnDagName

Fix imports:
   - remove “*” e,g, import org.apache.tez.common.*; 

abortTask vs close/cleanup
   - are both close and cleanup being invoked in both TezTaskRunner and Runner2 
even when a task is aborted? ( need to re-check code )

TezTaskRunner2

{code}
* If a kill request is made to the task, it will not communicate this 
information to * the AM - since a task KILL is an external event, and whoever 
invoked it should * be able to track it.
{code}
   - shouldn’t the AM be informed that the task has completed after the abort 
related steps have been done? 


{code}
private boolean isRunningState() {   return !taskComplete.get() && 
!killTaskRequested.get() && !stopContainerRequested.get() &&       
!errorSeen.get(); }
{code}
    - any possibility of simplifying this? Each get is atomic but the combined 
AND expression is not atomic. 
    - any reason why the task runner needs to know that the container stop is 
requested? For a task runner, it should only know about managing a task and not 
the container. TezChild should handle the container stop handling instead?

{code}
// Don't throw an error since the task is already in the process of shutting 
down. LOG.info("returning canCommit=false since task is not in a running 
state"); return false;
{code}
    - why would a task call canCommit while shutting down? Shouldn’t we throw 
an exception anyway as it is not meant to be called during shutdown?

TaskReporter
   - some functions such as shutdown not synchronized?

tez-ext-service-tests/ShuffleHandler
   - Not sure why we need a complex impl of a shuffle handler for a regression 
test? If this is meant for a benchmark tool, maybe it should categorized as 
such and become more maintstream?

tez-ext-service-tests
   - if the ext-service layer is meant to be a full fledged plugin, it would be 
good if this became the defacto guide on how to implement the necessary plugin 
layers to talk to a simple 3rd party service. Current approach seems quite 
complex. 

JoinValidate:
   - how does one actually use this example with configured external execution 
engines? The change is added but no real docs or information on how it can 
actually be run to demonstrate the feature. 

TezTaskCommunicatorImpl
   - Assuming this is the internal communicator used by Tez between the AM and 
YARN containers, why does it needs a user payload and initialization from the 
user payload? 

TaskCommunicator
   - has an initialize() but user payload seems to be parsed in the constructor 
in TezTaskCommunicatorImpl

TaskCommunicatorContextImpl

{code}
public boolean isKnownContainer(ContainerId containerId) {   return 
context.getAllContainers().get(containerId) != null; }
{code}
   - Shouldn’t each plugin manage its own containers? Or at least shouldn’t 
this query be done based on which launcher plugin was being used for the given 
container? Likewise for containerAlive().

APIs in TaskCommunicatorContextImpl probably some minor renaming - say 
something like notifyTaskFailure() maybe?

getCurretnDagName() - dag can be null. Not handled. Likewise for 
getInputVertexNames() and other apis where both dag, vertex ( even task ) could 
be null. Needs more defensive programming against bad plugins.

{code}
public void onStateUpdated(VertexStateUpdate event)
{code}
  - in this scenario, is the TaskCommunicator using the context to tell the AM 
that a vertex has completed?

dagCompleteStart() - not sure what this API is meant to signify. It seems like 
the context is being notified that the DAG has started?

Is there a need for the framework to make updates into the Context object? If 
yes, should the Context implement 2 interfaces? Should the internal objects 
just bind to the internal Impl objects or are they bound to the public plugin 
interfaces to catch compat errors? Binding to Impls directly may mean a smaller 
public API interface.

TaskAttemptListenerImpTezDag

ctor.setAccessible(true); 
   - why is this needed? Why not throw an error? 

General design question:
   - work preserving mode
   - not for this jira but any general thoughts on how work preserving restarts 
will work? For that matter, considering an LLAP service will not go down if the 
AM crashes, how can work/tasks be recovered without needing to re-run them?

 













 









 









> [Umbrella] Allow Tez to co-ordinate execution to external services
> ------------------------------------------------------------------
>
>                 Key: TEZ-2003
>                 URL: https://issues.apache.org/jira/browse/TEZ-2003
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>         Attachments: 2003_20150728.1.txt, Tez With External Services.pdf
>
>
> The Tez engine itself takes care of co-ordinating execution - controlling how 
> data gets routed (different connection patterns), fault tolerance, scheduling 
> of work, etc.
> This is currently tied to TaskSpecs defined within Tez and on containers 
> launched by Tez itself (TezChild).
> The proposal is to allow Tez to work with external services instead of just 
> containers launched by Tez. This involves several more pluggable layers to 
> work with alternate Task Specifications, custom launch and task allocation 
> mechanics, as well as custom scheduling sources.
> A simple example would be a simple a process with the capability to execute 
> multiple Tez TaskSpecs as threads. In such a case, a container launch isn't 
> really need and can be mocked. Sourcing / scheduling containers would need to 
> be pluggable.
> A more advanced example would be LLAP (HIVE-7926; 
> https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf).
> This works with custom interfaces - which would need to be supported by Tez, 
> along with a custom event model which would need translation hooks.
> Tez should be able to work with a combination of certain vertices running in 
> external services and others running in regular Tez containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TEZ-2003) [Umbrella] Allow Tez to co-ordinate execution to external services

Reply via email to