Dear Samza guys, We are here for some debugging suggestions on our Samza job (0.10.0), which lags behind on consumption after running for a couple of hours, regardless of the number of containers allocated (currently 5).
Briefly, the job aggregates events into sessions (in Avro) during process() and emits snapshots of the open sessions using window() every minute. This graph <https://www.dropbox.com/s/utywr1j5eku0ec0/Screenshot%202016-08-23%2010.33.16.png?dl=0> shows you where processing started to lag (red is the number of events received and green is the number of event processed). The end result is a steady increase of the consumer lag <https://www.dropbox.com/s/fppsv91c339xmdb/Screenshot%202016-08-23%2010.19.27.png?dl=0>. What we are trying to track down is where the performance bottleneck is. But it's unclear at the moment if that's in Samza or in Kafka. What we know so far: - Kafka producer seems to take a while writing to the downstream topic (changelog and session snapshots) shown by various timers. Not sure which numbers are critical but here are the producer metrics <https://www.dropbox.com/s/pzi9304gw5vmae2/Screenshot%202016-08-23%2010.57.33.png?dl=0> from one container. - avg windowing duration peaks at one point during the day (due to the number of open sessions) but everything is still sub-seconds <https://www.dropbox.com/s/y2ps6pbs1tf257e/Screenshot%202016-08-23%2010.44.19.png?dl=0> . - our Kafka cluster doesn't seem to be overloaded <https://www.dropbox.com/s/q01b4p4rg43spua/Screenshot%202016-08-23%2010.48.25.png?dl=0> with writes < 60MB/s across all three brokers >From all we know, we suspected that the bottleneck happens at producing to Kafka. But we need some help confirming that. Any suggestion is appreciated. David