[jira] [Updated] (SAMZA-1041) Multi-stage pipeline feature for Samza

Jake Maes (JIRA) Fri, 21 Oct 2016 10:41:52 -0700

     [ 
https://issues.apache.org/jira/browse/SAMZA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jake Maes updated SAMZA-1041:
-----------------------------
    Description: 
Samza provides a powerful framework for users to implement and deploy stream 
processors. One of the core concepts in Samza is a processor, which is deployed 
individually as a job. While a single job is sufficient to perform some basic 
stream processing, we have seen users apply Samza to more complex problems 
involving multi-stage jobs or pipelines. These range from jobs that require a 
separate repartitioner to co-partition streams for a join, to more complicated 
pipelines in which the stages are purposely decoupled like microservices. 
Historically, users would create a separate job for each stage of the pipeline 
and deploy them manually. A critical part of this manual deployment is creating 
and modifying the intermediate streams to have the appropriate configuration, 
in particular; partition count. This deployment model has proven to be tedious 
and error-prone because:

1. Stream creation is a manual process. If the streams are not pre-created with 
the appropriate configurations, it can lead to unexpected behavior in the 
pipeline. For example failure to join because keys are not being routed to a 
common Task.
2. Job deployment is a manual process. Each stage needs to be deployed 
separately, even though they are often deployed in the same cadence.
3. Configuration is associated with a processor, which makes it more difficult 
to reuse the processor. Configurations like task inputs and container count can 
vary for the same processor depending on the context (pipeline) in which it is 
executed. 
4. There is no early validation to detect a misconfigured pipeline. Instead 
users tend to notice that something is wrong long after the initial deployment 
by looking at metrics. 
5. There are common preconditions for processors (e.g. co-partitioned, or 
deduplicated inputs) that could be handled automatically in a system that has 
the “whole picture” of streams and processors.

Our goal is to alleviate these issues by introducing a yet-to-be-properly-named 
feature which we will call “pipelines” for now. The pipeline feature will allow 
users to easily compose a collection of processors and streams into a directed 
acyclic graph (DAG) and manage them as one unit. A key part of this feature is 
automatic runtime creation of intermediate and output streams. It also enables 
richer validation, simplified deployment, and a foundation for many performance 
and ease-of-use features. For example, repartitioners could be automatically 
injected where needed and processors could be colocated on the same container 
for performance. 

Note that this feature is not the same as SAMZA-914, mostly in terms of scope 
and simplicity. SAMZA-914 focuses on composing operators into a logical flow 
that is executed as one processor. That entire flow is scaled out uniformly by 
adding containers. By contrast, this pipeline feature provides isolation and 
independent scaling of the processors. 

Many of the details have yet to be worked out, but a design doc will be posted 
here soon. 

  was:
Samza provides a powerful framework for users to implement and deploy stream 
processors. One of the core concepts in Samza is a processor, which is deployed 
individually as a job. While a single job is sufficient to perform some basic 
stream processing, we have seen users apply Samza to more complex problems 
involving multi-stage jobs or pipelines. These range from jobs that require a 
separate repartitioner to co-partition streams for a join, to more complicated 
pipelines in which the stages are purposely decoupled like microservices. 
Historically, users would create a separate job for each stage of the pipeline 
and deploy them manually. A critical part of this manual deployment is creating 
and modifying the intermediate streams to have the appropriate configuration, 
in particular; partition count. This deployment model has proven to be tedious 
and error-prone because:

1. Stream creation is a manual process. If the streams are not pre-created with 
the appropriate configurations, it can lead to unexpected behavior in the 
pipeline. For example failure to join because keys are not being routed to a 
common Task.
2. Job deployment is a manual process. Each stage needs to be deployed 
separately, even though they are often deployed in the same cadence.
3. Configuration is associated with a processor, which makes it more difficult 
to reuse the processor. Configurations like task inputs and container count can 
vary for the same processor depending on the context (pipeline) in which it is 
executed. 
4. There is no early validation to detect a misconfigured pipeline. Instead 
users tend to notice that something is wrong long after the initial deployment 
by looking at metrics. 
5. There are common preconditions for processors (e.g. co-partitioned, or 
deduplicated inputs) that could be handled automatically in a system that has 
the “whole picture” of streams and processors.

Our goal is to alleviate these issues by introducing a yet-to-be-properly-named 
feature which we will call “pipelines” for now. The pipeline feature will allow 
users to easily compose a collection of processors and streams into a directed 
acyclic graph (DAG) and manage them as one unit. A key part of this feature is 
automatic runtime creation of intermediate and output streams. It also enables 
richer validation, simplified deployment, and a foundation for many performance 
and ease-of-use features. For example, repartitioners could be automatically 
injected where needed and processors could be colocated on the same container 
for performance. 

Many of the details have yet to be worked out, but a design doc will be posted 
here soon. 


> Multi-stage pipeline feature for Samza
> --------------------------------------
>
>                 Key: SAMZA-1041
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1041
>             Project: Samza
>          Issue Type: New Feature
>    Affects Versions: 0.12.0
>            Reporter: Jake Maes
>            Assignee: Jake Maes
>
> Samza provides a powerful framework for users to implement and deploy stream 
> processors. One of the core concepts in Samza is a processor, which is 
> deployed individually as a job. While a single job is sufficient to perform 
> some basic stream processing, we have seen users apply Samza to more complex 
> problems involving multi-stage jobs or pipelines. These range from jobs that 
> require a separate repartitioner to co-partition streams for a join, to more 
> complicated pipelines in which the stages are purposely decoupled like 
> microservices. Historically, users would create a separate job for each stage 
> of the pipeline and deploy them manually. A critical part of this manual 
> deployment is creating and modifying the intermediate streams to have the 
> appropriate configuration, in particular; partition count. This deployment 
> model has proven to be tedious and error-prone because:
> 1. Stream creation is a manual process. If the streams are not pre-created 
> with the appropriate configurations, it can lead to unexpected behavior in 
> the pipeline. For example failure to join because keys are not being routed 
> to a common Task.
> 2. Job deployment is a manual process. Each stage needs to be deployed 
> separately, even though they are often deployed in the same cadence.
> 3. Configuration is associated with a processor, which makes it more 
> difficult to reuse the processor. Configurations like task inputs and 
> container count can vary for the same processor depending on the context 
> (pipeline) in which it is executed. 
> 4. There is no early validation to detect a misconfigured pipeline. Instead 
> users tend to notice that something is wrong long after the initial 
> deployment by looking at metrics. 
> 5. There are common preconditions for processors (e.g. co-partitioned, or 
> deduplicated inputs) that could be handled automatically in a system that has 
> the “whole picture” of streams and processors.
> Our goal is to alleviate these issues by introducing a 
> yet-to-be-properly-named feature which we will call “pipelines” for now. The 
> pipeline feature will allow users to easily compose a collection of 
> processors and streams into a directed acyclic graph (DAG) and manage them as 
> one unit. A key part of this feature is automatic runtime creation of 
> intermediate and output streams. It also enables richer validation, 
> simplified deployment, and a foundation for many performance and ease-of-use 
> features. For example, repartitioners could be automatically injected where 
> needed and processors could be colocated on the same container for 
> performance. 
> Note that this feature is not the same as SAMZA-914, mostly in terms of scope 
> and simplicity. SAMZA-914 focuses on composing operators into a logical flow 
> that is executed as one processor. That entire flow is scaled out uniformly 
> by adding containers. By contrast, this pipeline feature provides isolation 
> and independent scaling of the processors. 
> Many of the details have yet to be worked out, but a design doc will be 
> posted here soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SAMZA-1041) Multi-stage pipeline feature for Samza

Reply via email to