Martin Kleppmann created SAMZA-300:
--------------------------------------
Summary: Track producers and consumers of streams
Key: SAMZA-300
URL: https://issues.apache.org/jira/browse/SAMZA-300
Project: Samza
Issue Type: New Feature
Reporter: Martin Kleppmann
Each Samza job runs independently, which has a lot of advantages. However,
there are situations in which it would be valuable to have a global overview of
the data flows between jobs. For example:
- It's important for correctness that only one job ever publishes to a given
checkpoint or changelog stream — if several jobs publish to the same stream,
the result is nonsensical. However, we currently have no way of enforcing that.
It would be good if a job could take a "write lock" on a stream, and thus
prevent others from writing to it.
- It would be awesome to have a dashboard/visualization that graphically shows
the job graph, and visually highlights the health of a job (e.g. whether a job
is fallen behind).
- The job graph would also be generally useful for tracking data provenance
(finding consumers who would be affected by a schema change, finding the team
that is responsible for producing a particular stream, etc)
- Potentially could include additional metadata about streams, e.g. owner,
serialization format, schema, documentation of semantics of the data, etc.
(HCatalog for streams?)
One possibility would be for Kafka to add some of this functionality, although
it may also make sense to implement it in Samza (that way it would be available
for non-Kafka systems as well, and could use knowledge about the job that Samza
has, but Kafka hasn't).
This is just a vague description to start a discussion. Please comment with
your ideas on how to best implement this.
--
This message was sent by Atlassian JIRA
(v6.2#6252)