[
https://issues.apache.org/jira/browse/SAMZA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936773#comment-15936773
]
Jake Maes commented on SAMZA-1120:
----------------------------------
Summarizing a couple LinkedIn meetings for transparency with the rest of the
community:
We will add a new Application scope but it will not conform to either of the
options in the description. Instead, the hierarchy from broadest to narrowest
will be: Application -> Job -> Processor -> Task
*Application* - the overall logical expression that will be treated as one
deployable unit. Note that in previous releases, the Job was the deployable
unit and that legacy mode will still be supported. However, all new
applications, particularly those that use the Fluent API will deploy
Applications. An Application with only one Job is essentially the same as the
legacy mode.
*Job* - One logical stage of an application. The Jobs of an Application are
structured in a DAG, connected by streams, can be deployed on separate hosts,
can be scaled independently of each other, and are only coordinated in terms of
start and stop order.
*Processor* - A physical unit of parallelism within an Application. This is
analogous to "Containers" in previous releases, but includes a JobCoordinator
in each one.
*Task* - The logical unit of parallelism. Tasks are mostly hidden and handled
automatically with the Fluent API
The only change for the 0.13.0 release will be the addition of the Application
scope. Since this first version will only create Applications with 1 Job, the
job.* configurations will serve the same purpose as they always have.
Beyond the 0.13.0 release, we will pursue the above taxonomy as part of a much
larger config overhaul to address a number of issues. Some are pre-existing and
others introduced by the multi-stage feature. Here are a few:
1. Task and Job scopes are basically synonymous in the config. There is no
clear guidance on whether a config is on the task or the job. We need to
distinguish them or remove one of them.
2. To make configs tractable with many jobs, many streams, many X, we need a
general pattern for defining defaults on those things. For example, one
proposal for streams was to introduce a "streams.DEFAULT.*" pattern which
defines the default values for all streams, which can be overridden on a
per-stream basis. Whatever the pattern, it needs to be formalized, documented,
and enforced.
3. Closely related to 2, we need a general pattern for setting/overriding
properties on a per-entity basis (e.g. per job)
4. We need a transparency layer to expose all generated entities (streams,
jobs, ...) and enable users to configure/tune them.
5. We need clear documentation on all of the above, both to enable devs to make
informed changes to config, and for users to easily configure their
applications.
[~nickpan47] Let me know if I missed anything or if any of the above is
inaccurate.
> Config scope changes for multi-stage
> ------------------------------------
>
> Key: SAMZA-1120
> URL: https://issues.apache.org/jira/browse/SAMZA-1120
> Project: Samza
> Issue Type: Bug
> Affects Versions: 0.13.0
> Reporter: Jake Maes
> Assignee: Jake Maes
>
> With the multi-stage feature (SAMZA-1041), Samza will have the ability to run
> a collection of processors as a unit. To configure those processors with a
> single config, we need a way to independently configure each processor.
> The current best idea is to introduce a new scope in the configs. Here are 2
> options that differ only in naming.
> 1. Application->Job->Task scopes for configs. Application configs apply to
> the entire multistage application. A job config corresponds to a particular
> processor or job in the application and a task config applies to a particular
> task.
> 2. Job->Processor->Task scopes for configs. In this model, a Job is the
> deployable unit and corresponds to the full multistage application. A
> processor is an independent stage in the application.
> The advantage of #1 is most developers seem to prefer the term "Application"
> over "Job"
> The advantage of #2 is minimal renaming in the code and configs, which will
> likely make migration easier.
> In both cases, we could define an inheritance structure s.t. any config not
> defined at the processor scope can be inherited from the application scope.
> This should reduce the verbosity of the configs.
> btw. this is an issue that app.runner config with job.factory.class config.
> Do we allow configuring job.factory.class in different runner mode? If so,
> what are the options?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)