[jira] [Commented] (TEZ-2003) [Umbrella] Allow Tez to co-ordinate execution to external services

Siddharth Seth (JIRA) Mon, 10 Aug 2015 19:36:06 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681128#comment-14681128
 ]


Siddharth Seth commented on TEZ-2003:
-------------------------------------

bq. should not need to have special casing for any plugin
Special casing is in place primarily for LocalContainerExecutor, which requires 
a bunch of information at runtime - which isn't needed in the context 
otherwise. There's a jira to provide such information via runtime binding in 
the payload. For the other cases, it's mainly used to make it simpler to write 
tests - where the default executor can be easily overwritten for the tests. The 
construction, along with the payload, remains the same - except it's direct 
instead of using reflection.

bq. Also creating a ServicePlugin class will help in reducing code duplication 
and make maintenance easier instead of having scheduler id, launcherId and 
commId everywhere.
The 3 constructs are not used together everywhere. There's multiple events / 
other classes which only use a subset of these. A single class won't really 
help there.

bq. ContainerSignatureMatcher -> ExecutorSignatureMatcher ?
Tracked in 2708.

bq. ServicePluginLifecyle etc. in tez-runtime-api like 
Inputs/Output/InputInitializer etc
shutdown would make more sense for a service.

bq. Why are executedInAm and executeInContainers there
executeInAm and executeInContainers are in Contexts to specify whether a task 
runs in a service or in the AM. It's possible to set a DAG level default to run 
everything in an external service, and some vertices either in containers or in 
the AM.
Similarly for the ServiceDescriptor - decide whether the AM runs containers or 
uber-mode during setup.

bq. Rename to ExecutorEndReason ?
Give the abstraction that exists is containers (an executor could be confused 
for a service daemon), ContainerEndReason seems fine. This can change when Tez 
introduces it's own version of 'Contaienrs' instead of relying on the YARN 
abstraction. 

bq. Also, how can "An error in the AM" be caused by a container running a task?
Not sure what you mean by this. " An error in the AM caused by user code" - 
implies an error which occurred in the AM process as a result of a plugin.

bq. Why does this have schedulerName and taskCommName ?
It's used for the startRequest.

bq. Has the internal one been replaced by this?
No but there's a jira open to consolidate the two.

bq. Rename to ExecutorBusy ?
tracked in TEZ-2707

bq. Why isLocal flag needs to be passed to Scheduler/Launcher/Communicator 
routers? Instead of a service plugin for local
There's certain operations which are performed differently for local mode. Also 
used to indicate to internal plugins whether they're running in local / uber 
mode.

bq. Is is ensured that the integer for a service plugin will turn out to be the 
same after AM restart?
Yes

bq. Why is yarn scheduler special cased? Launcher/Communicator dont have the 
special casing ?
To always run the YARNScheduler (i.e. register with YARN) if running in 
non-local mode. If we were to support alternate frameworks, this could be 
removed.

bq. Why use different code path for uber/default. They should just work when 
instantiated the same way as a custom plugin.
Primarily for testing. First part of this comment.

bq. Are this and other methods threadsafe wrt callback from multiple plugins?
They should be. I'll scan through them. Would appreciate if you do the same to 
identify issues.

bq. Also in heartbeat(), the following code has been lost during merge.
Tracked in TEZ-2707

bq. Why are the contextImpls not directing invoking/handling the plugins 
instead of going through the router?
They don't need to. ContextImpls are primarily for communication from the 
plugins to the framework. The routers should handle framework to plugins.

bq. Why are the contextImpls not directing invoking/handling the plugins 
instead of going through the router?
This avoids some race between dag transitions.

bq. Why has the synchronization been removed. I remember this being a subtle 
race condition.
sync on containerInfo is no longer required since there's a new entry inserted 
into the structure each time.

bq. The dagCompleteStart/End logic is either broken or unnecessary because the 
correct dag seems to be always received from appContext.getCurrentDAG().
This is again for transitions between DAGs. A new dag is received when a dag is 
submitted - the context update needs to be factored out. dagComplete is sent to 
a plugin - which can take an arbitrary time to process. During this time, any 
lookups it does will be from the last dag - instead of a possible new dag, 
which could be submitted anytime.

bq. Why not keep a cached copy instead of converting each time?
Fixed in TEZ-2678

bq. There is a scheduledTime on master that this is duplicating.
Will create a nice conflict when i rebase the branch next. Will resolve it then.

bq. What is the use of this?
Can be used for scheduling decisions around aging of a task.

bq. Why is the vertex code concerned with checking local mode/uber etc.? It 
should simply take the vertex or default
execution context and use it.
It has to take a decision about running in local mode / non-local mode - 
contexts can be null. They're not always setup and sent over the wire.

bq. Create base class that takes care of the repeated schedulerId code in all 
scheduler events?
I believe I tried doing this and there were some complications, or it was done 
for some other events. IAC, will look. Tracked in TEZ-2707.

bq. Why is the end reason here. Dont see any use anywhere in the patch.
It's not used by the Tez schedulers, but can be used by external schedulers to 
blacklist nodes - for instance when an external service is busy.

bq. Have a base class that prevents code duplication for launcherId etc. for 
all AMContainerEvents
Same as for the scheduler events.

bq. The pending changes to AMNode (in general node handling per scheduler) 
really need to be prioritized. They look quite incomplete because nodes are 
closely tied to execution environment and the current state of the code with 
some bits handled and some bits not, can be error prone.
This should be in place already. TEZ-2124 is done.

bq. What is a source? Can we call it scheduler or execution environment?
Typically a scheduler. What gets interesting here is when the same source 
provides to multiple executors. At that point we'd need to differentiate 
between a node going down as a result of the 'real' source reporting it, vs Tez 
making an inference based on failed tasks.

bq. Remove commented code
Tracked in 2707

bq. New code misses the ++cData.numUpdates
numUpdate handling no longer done via TaskAttemptListener, so this isn't needed.

bq. Why do we need the new session token related code here?
Separation between TaskAttemptListener and the plugin it creates. A token is 
needed.

bq. Create shared method instead of duplicating code?
2707

bq. ExecutionContextTestInfoHolder
2707

bq. Why is this mock return of RUNNING state missing. Allocated containers 
checks for running state. Is that handled by the new code in 
TestTaskSchedulerHelpers?
That's served and checked via the context.

> [Umbrella] Allow Tez to co-ordinate execution to external services
> ------------------------------------------------------------------
>
>                 Key: TEZ-2003
>                 URL: https://issues.apache.org/jira/browse/TEZ-2003
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>         Attachments: 2003_20150728.1.txt, 2003_20150807.1.txt, 
> 2003_20150807.2.txt, Tez With External Services.pdf
>
>
> The Tez engine itself takes care of co-ordinating execution - controlling how 
> data gets routed (different connection patterns), fault tolerance, scheduling 
> of work, etc.
> This is currently tied to TaskSpecs defined within Tez and on containers 
> launched by Tez itself (TezChild).
> The proposal is to allow Tez to work with external services instead of just 
> containers launched by Tez. This involves several more pluggable layers to 
> work with alternate Task Specifications, custom launch and task allocation 
> mechanics, as well as custom scheduling sources.
> A simple example would be a simple a process with the capability to execute 
> multiple Tez TaskSpecs as threads. In such a case, a container launch isn't 
> really need and can be mocked. Sourcing / scheduling containers would need to 
> be pluggable.
> A more advanced example would be LLAP (HIVE-7926; 
> https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf).
> This works with custom interfaces - which would need to be supported by Tez, 
> along with a custom event model which would need translation hooks.
> Tez should be able to work with a combination of certain vertices running in 
> external services and others running in regular Tez containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2003) [Umbrella] Allow Tez to co-ordinate execution to external services

Reply via email to