[ 
https://issues.apache.org/jira/browse/SAMZA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936773#comment-15936773
 ] 

Jake Maes commented on SAMZA-1120:
----------------------------------

Summarizing a couple LinkedIn meetings for transparency with the rest of the 
community:

We will add a new Application scope but it will not conform to either of the 
options in the description. Instead, the hierarchy from broadest to narrowest 
will be: Application -> Job -> Processor -> Task
*Application* - the overall logical expression that will be treated as one 
deployable unit. Note that in previous releases, the Job was the deployable 
unit and that legacy mode will still be supported. However, all new 
applications, particularly those that use the Fluent API will deploy 
Applications. An Application with only one Job is essentially the same as the 
legacy mode.
*Job* - One logical stage of an application. The Jobs of an Application are 
structured in a DAG, connected by streams, can be deployed on separate hosts, 
can be scaled independently of each other, and are only coordinated in terms of 
start and stop order.
*Processor* - A physical unit of parallelism within an Application. This is 
analogous to "Containers" in previous releases, but includes a JobCoordinator 
in each one. 
*Task* - The logical unit of parallelism. Tasks are mostly hidden and handled 
automatically with the Fluent API

The only change for the 0.13.0 release will be the addition of the Application 
scope. Since this first version will only create Applications with 1 Job, the 
job.* configurations will serve the same purpose as they always have. 

Beyond the 0.13.0 release, we will pursue the above taxonomy as part of a much 
larger config overhaul to address a number of issues. Some are pre-existing and 
others introduced by the multi-stage feature. Here are a few:
1. Task and Job scopes are basically synonymous in the config. There is no 
clear guidance on whether a config is on the task or the job. We need to 
distinguish them or remove one of them.
2. To make configs tractable with many jobs, many streams, many X, we need a 
general pattern for defining defaults on those things. For example, one 
proposal for streams was to introduce a "streams.DEFAULT.*" pattern which 
defines the default values for all streams, which can be overridden on a 
per-stream basis. Whatever the pattern, it needs to be formalized, documented, 
and enforced. 
3. Closely related to 2, we need a general pattern for setting/overriding 
properties on a per-entity basis (e.g. per job)
4. We need a transparency layer to expose all generated entities (streams, 
jobs, ...) and enable users to configure/tune them.
5. We need clear documentation on all of the above, both to enable devs to make 
informed changes to config, and for users to easily configure their 
applications.

[~nickpan47] Let me know if I missed anything or if any of the above is 
inaccurate.


> Config scope changes for multi-stage
> ------------------------------------
>
>                 Key: SAMZA-1120
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1120
>             Project: Samza
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Jake Maes
>            Assignee: Jake Maes
>
> With the multi-stage feature (SAMZA-1041), Samza will have the ability to run 
> a collection of processors as a unit. To configure those processors with a 
> single config, we need a way to independently configure each processor. 
> The current best idea is to introduce a new scope in the configs. Here are 2 
> options that differ only in naming. 
> 1. Application->Job->Task scopes for configs. Application configs apply to 
> the entire multistage application. A job config corresponds to a particular 
> processor or job in the application and a task config applies to a particular 
> task. 
> 2. Job->Processor->Task scopes for configs. In this model, a Job is the 
> deployable unit and corresponds to the full multistage application. A 
> processor is an independent stage in the application.
> The advantage of #1 is most developers seem to prefer the term "Application" 
> over "Job"
> The advantage of #2 is minimal renaming in the code and configs, which will 
> likely make migration easier. 
> In both cases, we could define an inheritance structure s.t. any config not 
> defined at the processor scope can be inherited from the application scope. 
> This should reduce the verbosity of the configs.
> btw. this is an issue that app.runner config with job.factory.class config. 
> Do we allow configuring job.factory.class in different runner mode? If so, 
> what are the options?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to