[jira] [Commented] (SAMZA-300) Track producers and consumers of streams

Chris Riccomini (JIRA) Wed, 09 Jul 2014 16:05:03 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056880#comment-14056880
 ]


Chris Riccomini commented on SAMZA-300:
---------------------------------------

My half-baked thoughts:

At a high level, this ticket touches on two main themes: job 
deployment/verification, and visualization. I've always thought that these two 
issues should be solved by providing a deployment/metrics/monitoring/etc 
dashboard as an independent project for Samza. Like Oozi or Azkaban for Hadoop. 
In Samza's case, some of the metrics should go into the job's AM, but some 
stuff (e.g. job flow visualization and deployment) would exist as a standalone 
dashboard.

bq. It's important for correctness that only one job ever publishes to a given 
checkpoint or changelog stream — if several jobs publish to the same stream, 
the result is nonsensical. However, we currently have no way of enforcing that. 
It would be good if a job could take a "write lock" on a stream, and thus 
prevent others from writing to it.

This sounds like a useful Kafka feature. Acquiring a lock on a topic/partition 
from the broker seems like a great way to guarantee single-writer. It's also 
somewhat related to Kafka's generation-id/transactionality discussion (bump the 
generation ID every time you need to lock everyone else out). See 
https://cwiki.apache.org/confluence/display/KAFKA/Transactional+Messaging+in+Kafka
 for a big discussion on all of this.

bq. It would be awesome to have a dashboard/visualization that graphically 
shows the job graph, and visually highlights the health of a job (e.g. whether 
a job is fallen behind).

We have achieved this in the past simply by using the MetricsSnapshotReporter, 
and then consuming metrics from all of the jobs. This allows you to stitch 
together who is reading and writing to streams. From there, you can draw a full 
data flow of all your Samza jobs, along with messages/sec, lag, etc.

bq. Potentially could include additional metadata about streams, e.g. owner, 
serialization format, schema, documentation of semantics of the data, etc. 
(HCatalog for streams?)

Seems somewhat related to your Avro serde/schema registry ticket (SAMZA-317).

> Track producers and consumers of streams
> ----------------------------------------
>
>                 Key: SAMZA-300
>                 URL: https://issues.apache.org/jira/browse/SAMZA-300
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Martin Kleppmann
>
> Each Samza job runs independently, which has a lot of advantages. However, 
> there are situations in which it would be valuable to have a global overview 
> of the data flows between jobs. For example:
> - It's important for correctness that only one job ever publishes to a given 
> checkpoint or changelog stream — if several jobs publish to the same stream, 
> the result is nonsensical. However, we currently have no way of enforcing 
> that. It would be good if a job could take a "write lock" on a stream, and 
> thus prevent others from writing to it.
> - It would be awesome to have a dashboard/visualization that graphically 
> shows the job graph, and visually highlights the health of a job (e.g. 
> whether a job is fallen behind).
> - The job graph would also be generally useful for tracking data provenance 
> (finding consumers who would be affected by a schema change, finding the team 
> that is responsible for producing a particular stream, etc)
> - Potentially could include additional metadata about streams, e.g. owner, 
> serialization format, schema, documentation of semantics of the data, etc. 
> (HCatalog for streams?)
> One possibility would be for Kafka to add some of this functionality, 
> although it may also make sense to implement it in Samza (that way it would 
> be available for non-Kafka systems as well, and could use knowledge about the 
> job that Samza has, but Kafka hasn't).
> This is just a vague description to start a discussion. Please comment with 
> your ideas on how to best implement this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SAMZA-300) Track producers and consumers of streams

Reply via email to