[
https://issues.apache.org/jira/browse/TEZ-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681128#comment-14681128
]
Siddharth Seth commented on TEZ-2003:
-------------------------------------
bq. should not need to have special casing for any plugin
Special casing is in place primarily for LocalContainerExecutor, which requires
a bunch of information at runtime - which isn't needed in the context
otherwise. There's a jira to provide such information via runtime binding in
the payload. For the other cases, it's mainly used to make it simpler to write
tests - where the default executor can be easily overwritten for the tests. The
construction, along with the payload, remains the same - except it's direct
instead of using reflection.
bq. Also creating a ServicePlugin class will help in reducing code duplication
and make maintenance easier instead of having scheduler id, launcherId and
commId everywhere.
The 3 constructs are not used together everywhere. There's multiple events /
other classes which only use a subset of these. A single class won't really
help there.
bq. ContainerSignatureMatcher -> ExecutorSignatureMatcher ?
Tracked in 2708.
bq. ServicePluginLifecyle etc. in tez-runtime-api like
Inputs/Output/InputInitializer etc
shutdown would make more sense for a service.
bq. Why are executedInAm and executeInContainers there
executeInAm and executeInContainers are in Contexts to specify whether a task
runs in a service or in the AM. It's possible to set a DAG level default to run
everything in an external service, and some vertices either in containers or in
the AM.
Similarly for the ServiceDescriptor - decide whether the AM runs containers or
uber-mode during setup.
bq. Rename to ExecutorEndReason ?
Give the abstraction that exists is containers (an executor could be confused
for a service daemon), ContainerEndReason seems fine. This can change when Tez
introduces it's own version of 'Contaienrs' instead of relying on the YARN
abstraction.
bq. Also, how can "An error in the AM" be caused by a container running a task?
Not sure what you mean by this. " An error in the AM caused by user code" -
implies an error which occurred in the AM process as a result of a plugin.
bq. Why does this have schedulerName and taskCommName ?
It's used for the startRequest.
bq. Has the internal one been replaced by this?
No but there's a jira open to consolidate the two.
bq. Rename to ExecutorBusy ?
tracked in TEZ-2707
bq. Why isLocal flag needs to be passed to Scheduler/Launcher/Communicator
routers? Instead of a service plugin for local
There's certain operations which are performed differently for local mode. Also
used to indicate to internal plugins whether they're running in local / uber
mode.
bq. Is is ensured that the integer for a service plugin will turn out to be the
same after AM restart?
Yes
bq. Why is yarn scheduler special cased? Launcher/Communicator dont have the
special casing ?
To always run the YARNScheduler (i.e. register with YARN) if running in
non-local mode. If we were to support alternate frameworks, this could be
removed.
bq. Why use different code path for uber/default. They should just work when
instantiated the same way as a custom plugin.
Primarily for testing. First part of this comment.
bq. Are this and other methods threadsafe wrt callback from multiple plugins?
They should be. I'll scan through them. Would appreciate if you do the same to
identify issues.
bq. Also in heartbeat(), the following code has been lost during merge.
Tracked in TEZ-2707
bq. Why are the contextImpls not directing invoking/handling the plugins
instead of going through the router?
They don't need to. ContextImpls are primarily for communication from the
plugins to the framework. The routers should handle framework to plugins.
bq. Why are the contextImpls not directing invoking/handling the plugins
instead of going through the router?
This avoids some race between dag transitions.
bq. Why has the synchronization been removed. I remember this being a subtle
race condition.
sync on containerInfo is no longer required since there's a new entry inserted
into the structure each time.
bq. The dagCompleteStart/End logic is either broken or unnecessary because the
correct dag seems to be always received from appContext.getCurrentDAG().
This is again for transitions between DAGs. A new dag is received when a dag is
submitted - the context update needs to be factored out. dagComplete is sent to
a plugin - which can take an arbitrary time to process. During this time, any
lookups it does will be from the last dag - instead of a possible new dag,
which could be submitted anytime.
bq. Why not keep a cached copy instead of converting each time?
Fixed in TEZ-2678
bq. There is a scheduledTime on master that this is duplicating.
Will create a nice conflict when i rebase the branch next. Will resolve it then.
bq. What is the use of this?
Can be used for scheduling decisions around aging of a task.
bq. Why is the vertex code concerned with checking local mode/uber etc.? It
should simply take the vertex or default
execution context and use it.
It has to take a decision about running in local mode / non-local mode -
contexts can be null. They're not always setup and sent over the wire.
bq. Create base class that takes care of the repeated schedulerId code in all
scheduler events?
I believe I tried doing this and there were some complications, or it was done
for some other events. IAC, will look. Tracked in TEZ-2707.
bq. Why is the end reason here. Dont see any use anywhere in the patch.
It's not used by the Tez schedulers, but can be used by external schedulers to
blacklist nodes - for instance when an external service is busy.
bq. Have a base class that prevents code duplication for launcherId etc. for
all AMContainerEvents
Same as for the scheduler events.
bq. The pending changes to AMNode (in general node handling per scheduler)
really need to be prioritized. They look quite incomplete because nodes are
closely tied to execution environment and the current state of the code with
some bits handled and some bits not, can be error prone.
This should be in place already. TEZ-2124 is done.
bq. What is a source? Can we call it scheduler or execution environment?
Typically a scheduler. What gets interesting here is when the same source
provides to multiple executors. At that point we'd need to differentiate
between a node going down as a result of the 'real' source reporting it, vs Tez
making an inference based on failed tasks.
bq. Remove commented code
Tracked in 2707
bq. New code misses the ++cData.numUpdates
numUpdate handling no longer done via TaskAttemptListener, so this isn't needed.
bq. Why do we need the new session token related code here?
Separation between TaskAttemptListener and the plugin it creates. A token is
needed.
bq. Create shared method instead of duplicating code?
2707
bq. ExecutionContextTestInfoHolder
2707
bq. Why is this mock return of RUNNING state missing. Allocated containers
checks for running state. Is that handled by the new code in
TestTaskSchedulerHelpers?
That's served and checked via the context.
> [Umbrella] Allow Tez to co-ordinate execution to external services
> ------------------------------------------------------------------
>
> Key: TEZ-2003
> URL: https://issues.apache.org/jira/browse/TEZ-2003
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Attachments: 2003_20150728.1.txt, 2003_20150807.1.txt,
> 2003_20150807.2.txt, Tez With External Services.pdf
>
>
> The Tez engine itself takes care of co-ordinating execution - controlling how
> data gets routed (different connection patterns), fault tolerance, scheduling
> of work, etc.
> This is currently tied to TaskSpecs defined within Tez and on containers
> launched by Tez itself (TezChild).
> The proposal is to allow Tez to work with external services instead of just
> containers launched by Tez. This involves several more pluggable layers to
> work with alternate Task Specifications, custom launch and task allocation
> mechanics, as well as custom scheduling sources.
> A simple example would be a simple a process with the capability to execute
> multiple Tez TaskSpecs as threads. In such a case, a container launch isn't
> really need and can be mocked. Sourcing / scheduling containers would need to
> be pluggable.
> A more advanced example would be LLAP (HIVE-7926;
> https://issues.apache.org/jira/secure/attachment/12665704/LLAPdesigndocument.pdf).
> This works with custom interfaces - which would need to be supported by Tez,
> along with a custom event model which would need translation hooks.
> Tez should be able to work with a combination of certain vertices running in
> external services and others running in regular Tez containers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)