Dear Samza guys,

We are here for some debugging suggestions on our Samza job (0.10.0), which
lags behind on consumption after running for a couple of hours, regardless
of the number of containers allocated (currently 5).

Briefly, the job aggregates events into sessions (in Avro) during process()
and emits snapshots of the open sessions using window() every minute. This
graph
<https://www.dropbox.com/s/utywr1j5eku0ec0/Screenshot%202016-08-23%2010.33.16.png?dl=0>
shows
you where processing started to lag (red is the number of events received
and green is the number of event processed). The end result is a steady
increase of the consumer lag
<https://www.dropbox.com/s/fppsv91c339xmdb/Screenshot%202016-08-23%2010.19.27.png?dl=0>.
What we are trying to track down is where the performance bottleneck is.
But it's unclear at the moment if that's in Samza or in Kafka.

What we know so far:

   - Kafka producer seems to take a while writing to the downstream topic
   (changelog and session snapshots) shown by various timers. Not sure which
   numbers are critical but here are the producer metrics
   
<https://www.dropbox.com/s/pzi9304gw5vmae2/Screenshot%202016-08-23%2010.57.33.png?dl=0>
from
   one container.
   - avg windowing duration peaks at one point during the day (due to the
   number of open sessions) but everything is still sub-seconds
   
<https://www.dropbox.com/s/y2ps6pbs1tf257e/Screenshot%202016-08-23%2010.44.19.png?dl=0>
   .
   - our Kafka cluster doesn't seem to be overloaded
   
<https://www.dropbox.com/s/q01b4p4rg43spua/Screenshot%202016-08-23%2010.48.25.png?dl=0>
    with writes < 60MB/s across all three brokers

>From all we know, we suspected that the bottleneck happens at producing to
Kafka. But we need some help confirming that.

Any suggestion is appreciated.

David

Reply via email to