Hi,

I was wonder what the best practice is for monitoring lag in Samza jobs.
It looks like only the commited offsets are stored in the
TaskInstanceMetrics so it seems there are two options for computing lag.

 a) have a process consume the metrics topic.  When it receives a message,
it would query Kafka brokers for latest offset and compute lag.  The
problem is that this process will not deal with old messages correctly (as
the brokers have moved on) if it even falls behind (because of maintenance
or bugs).

b) poll metrics via JMX and compute lag at the same time.  The problem is
that the polling process has to know how and have permission to connect to
all Samza containers.  With this approach, we loose the advantage of the
central metrics topic.

Any suggestions?  Would it be too much overhead to fetch and store the
latest offset for each SSP during checkpoint?  I think this would allow a
lag metric to be exposed for each SSP.

Thanks,

Roger

Reply via email to