[
https://issues.apache.org/jira/browse/SAMZA-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037898#comment-14037898
]
Yan Fang commented on SAMZA-300:
--------------------------------
{quote}
It's important for correctness that only one job ever publishes to a given
checkpoint or changelog stream — if several jobs publish to the same stream,
the result is nonsensical. However, we currently have no way of enforcing that.
It would be good if a job could take a "write lock" on a stream, and thus
prevent others from writing to it.
{quote}
I am not sure in which situation two jobs will publish to the same
changelog/checkpoint stream ? From my understanding, all jobs will be assigned
different kafka topic names.
Another correctness scenario I can imagine is users accidentally publish wrong
messages to the checkpoint/changelog topic with whatever methods, such as
command line, other streaming project. But this maybe another topic?
{quote}
It would be awesome to have ...... Potentially could include additional
metadata about streams, e.g. owner, serialization format, schema, documentation
of semantics of the data, etc. (HCatalog for streams?)
{quote}
I always like the graph to show something. A picture worth a thousands words.
But have not had the idea how this should be implemented. Maybe throw the
relevant information to anther stream which is for the dashboard?
> Track producers and consumers of streams
> ----------------------------------------
>
> Key: SAMZA-300
> URL: https://issues.apache.org/jira/browse/SAMZA-300
> Project: Samza
> Issue Type: New Feature
> Reporter: Martin Kleppmann
>
> Each Samza job runs independently, which has a lot of advantages. However,
> there are situations in which it would be valuable to have a global overview
> of the data flows between jobs. For example:
> - It's important for correctness that only one job ever publishes to a given
> checkpoint or changelog stream — if several jobs publish to the same stream,
> the result is nonsensical. However, we currently have no way of enforcing
> that. It would be good if a job could take a "write lock" on a stream, and
> thus prevent others from writing to it.
> - It would be awesome to have a dashboard/visualization that graphically
> shows the job graph, and visually highlights the health of a job (e.g.
> whether a job is fallen behind).
> - The job graph would also be generally useful for tracking data provenance
> (finding consumers who would be affected by a schema change, finding the team
> that is responsible for producing a particular stream, etc)
> - Potentially could include additional metadata about streams, e.g. owner,
> serialization format, schema, documentation of semantics of the data, etc.
> (HCatalog for streams?)
> One possibility would be for Kafka to add some of this functionality,
> although it may also make sense to implement it in Samza (that way it would
> be available for non-Kafka systems as well, and could use knowledge about the
> job that Samza has, but Kafka hasn't).
> This is just a vague description to start a discussion. Please comment with
> your ideas on how to best implement this.
--
This message was sent by Atlassian JIRA
(v6.2#6252)