Hi, I was wonder what the best practice is for monitoring lag in Samza jobs. It looks like only the commited offsets are stored in the TaskInstanceMetrics so it seems there are two options for computing lag.
a) have a process consume the metrics topic. When it receives a message, it would query Kafka brokers for latest offset and compute lag. The problem is that this process will not deal with old messages correctly (as the brokers have moved on) if it even falls behind (because of maintenance or bugs). b) poll metrics via JMX and compute lag at the same time. The problem is that the polling process has to know how and have permission to connect to all Samza containers. With this approach, we loose the advantage of the central metrics topic. Any suggestions? Would it be too much overhead to fetch and store the latest offset for each SSP during checkpoint? I think this would allow a lag metric to be exposed for each SSP. Thanks, Roger
