[
https://issues.apache.org/jira/browse/SAMZA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jake Maes updated SAMZA-1041:
-----------------------------
Summary: Multi-stage feature for Samza (was: Multi-stage pipeline feature
for Samza)
> Multi-stage feature for Samza
> -----------------------------
>
> Key: SAMZA-1041
> URL: https://issues.apache.org/jira/browse/SAMZA-1041
> Project: Samza
> Issue Type: New Feature
> Affects Versions: 0.12.0
> Reporter: Jake Maes
> Assignee: Jake Maes
>
> Samza provides a powerful framework for users to implement and deploy stream
> processors. One of the core concepts in Samza is a processor, which is
> deployed individually as a job. While a single job is sufficient to perform
> some basic stream processing, we have seen users apply Samza to more complex
> problems involving multi-stage jobs or pipelines. These range from jobs that
> require a separate repartitioner to co-partition streams for a join, to more
> complicated pipelines in which the stages are purposely decoupled like
> microservices. Historically, users would create a separate job for each stage
> of the pipeline and deploy them manually. A critical part of this manual
> deployment is creating and modifying the intermediate streams to have the
> appropriate configuration, in particular; partition count. This deployment
> model has proven to be tedious and error-prone because:
> 1. Stream creation is a manual process. If the streams are not pre-created
> with the appropriate configurations, it can lead to unexpected behavior in
> the pipeline. For example failure to join because keys are not being routed
> to a common Task.
> 2. Job deployment is a manual process. Each stage needs to be deployed
> separately, even though they are often deployed in the same cadence.
> 3. Configuration is associated with a processor, which makes it more
> difficult to reuse the processor. Configurations like task inputs and
> container count can vary for the same processor depending on the context
> (pipeline) in which it is executed.
> 4. There is no early validation to detect a misconfigured pipeline. Instead
> users tend to notice that something is wrong long after the initial
> deployment by looking at metrics.
> 5. There are common preconditions for processors (e.g. co-partitioned, or
> deduplicated inputs) that could be handled automatically in a system that has
> the “whole picture” of streams and processors.
> Our goal is to alleviate these issues by introducing a
> yet-to-be-properly-named feature which we will call “pipelines” for now. The
> pipeline feature will allow users to easily compose a collection of
> processors and streams into a directed acyclic graph (DAG) and manage them as
> one unit. A key part of this feature is automatic runtime creation of
> intermediate and output streams. It also enables richer validation,
> simplified deployment, and a foundation for many performance and ease-of-use
> features. For example, repartitioners could be automatically injected where
> needed and processors could be colocated on the same container for
> performance.
> Note that this feature is not the same as SAMZA-914, mostly in terms of scope
> and simplicity. SAMZA-914 focuses on composing operators into a logical flow
> that is executed as one processor. That entire flow is scaled out uniformly
> by adding containers. By contrast, this pipeline feature provides isolation
> and independent scaling of the processors.
> Many of the details have yet to be worked out, but a design doc will be
> posted here soon.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)