Jake Maes created SAMZA-1041:
--------------------------------
Summary: Multi-stage pipeline feature for Samza
Key: SAMZA-1041
URL: https://issues.apache.org/jira/browse/SAMZA-1041
Project: Samza
Issue Type: New Feature
Affects Versions: 0.12.0
Reporter: Jake Maes
Assignee: Jake Maes
Samza provides a powerful framework for users to implement and deploy stream
processors. One of the core concepts in Samza is a processor, which is deployed
individually as a job. While a single job is sufficient to perform some basic
stream processing, we have seen users apply Samza to more complex problems
involving multi-stage jobs or pipelines. These range from jobs that require a
separate repartitioner to co-partition streams for a join, to more complicated
pipelines in which the stages are purposely decoupled like microservices.
Historically, users would create a separate job for each stage of the pipeline
and deploy them manually. A critical part of this manual deployment is creating
and modifying the intermediate streams to have the appropriate configuration,
in particular; partition count. This deployment model has proven to be tedious
and error-prone because:
1. Stream creation is a manual process. If the streams are not pre-created with
the appropriate configurations, it can lead to unexpected behavior in the
pipeline. For example failure to join because keys are not being routed to a
common Task.
2. Job deployment is a manual process. Each stage needs to be deployed
separately, even though they are often deployed in the same cadence.
3. Configuration is associated with a processor, which makes it more difficult
to reuse the processor. Configurations like task inputs and container count can
vary for the same processor depending on the context (pipeline) in which it is
executed.
4. There is no early validation to detect a misconfigured pipeline. Instead
users tend to notice that something is wrong long after the initial deployment
by looking at metrics.
5. There are common preconditions for processors (e.g. co-partitioned, or
deduplicated inputs) that could be handled automatically in a system that has
the “whole picture” of streams and processors.
Our goal is to alleviate these issues by introducing a yet-to-be-properly-named
feature which we will call “pipelines” for now. The pipeline feature will allow
users to easily compose a collection of processors and streams into a directed
acyclic graph (DAG) and manage them as one unit. A key part of this feature is
automatic runtime creation of intermediate and output streams. It also enables
richer validation, simplified deployment, and a foundation for many performance
and ease-of-use features. For example, repartitioners could be automatically
injected where needed and processors could be colocated on the same container
for performance.
Many of the details have yet to be worked out, but a design doc will be posted
here soon.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)