More updates: 1. process-envelopes rate finally stabilized and converged. Consumer lag is down to zero. 2. avg choose-ns across containers dropped overtime <https://www.dropbox.com/s/z4iiilvd7c1wrjc/Screenshot%202016-08-24%2010.46.22.png?dl=0>, which I assume is a good thing.
My question: Is choose-ns the total number of ms used to choose a message from the input stream? What are some gating factors (e.g. serialization?) for this metric? Thanks, David On Wed, Aug 24, 2016 at 12:34 AM David Yu <david...@optimizely.com> wrote: > Some metric updates: > 1. We started seeing some containers with a higher choose-ns > <https://www.dropbox.com/s/06n3awdwn8ntfxd/Screenshot%202016-08-24%2000.26.07.png?dl=0>. > Not sure what would be the cause of this. > 2. We are seeing very different process-envelopes values across containers > <https://www.dropbox.com/s/n1wxtngquv607nb/Screenshot%202016-08-24%2000.21.05.png?dl=0> > . > > > > On Tue, Aug 23, 2016 at 5:56 PM David Yu <david...@optimizely.com> wrote: > >> Hi, Jake, >> >> Thanks for your suggestions. Some of my answers inline: >> >> 1. >> On Tue, Aug 23, 2016 at 11:53 AM Jacob Maes <jacob.m...@gmail.com> wrote: >> >>> Hey David, >>> >>> A few initial thoughts/questions: >>> >>> >>> 1. Is this job using RocksDB to store the aggregations? If so, is it >>> running on a machine with SSDs? We've seen a few performance issues >>> related >>> to RocksDB. >>> 1. Not running on SSD causes slowness on disk >> >> - [David] This definitely pointed me to the right direction in my >> investigation. We do use RocksDB and just realized that our YARN cluster is >> using c3.xlarge EC2 instances and is using a mixture of EBS and local SSD >> storage. After digging around, we noticed that some containers were >> persisting their KV stores in SSD while others were using EBS. We just >> updated our YARN config to use SSD only before redeployed our jobs. So far >> everything looks good. Will report back on the performance update. >> >>> 2. Prior to Samza 10.1, samza would excessively flush the store to >>> disk, causing RocksDB compaction issues (stalls) - SAMZA-957 >>> >> - [David] We did notice that the log cleaner thread died on one of our >> brokers. Not sure if this was the same problem you pointed out. Following >> errors are logged: >> >> 2016-08-15 10:00:56,475 ERROR kafka.log.LogCleaner: >> [kafka-log-cleaner-thread-0], Error due to >> >> java.lang.IllegalArgumentException: requirement failed: 5865800 messages >> in segment session-store-2.0-tickets-changelog-9/00000000000009548937.log >> but offset map can fit only 5033164. You can increase >> log.cleaner.dedupe.buffer.size or decrease log.cleaner.threads >> >> at scala.Predef$.require(Predef.scala:219) >> >> We had to cleanup the changelog topic and restart the broker to bring >> back the cleaner thread. >> >>> 3. When the RocksDB store is used as a queue, the iterator can >>> suffer >>> performance issues due to RocksDBs tombstoning. ( >>> >>> https://github.com/facebook/rocksdb/wiki/Implement-Queue-Service-Using-RocksDB >>> ) >>> >> - [David] We use RocksDB to keep track of opening sessions and use >> sessionId (a random hash) as the key. In that sense, this does not sound >> like a queue. But we do iterate and delete closed sessions during windowing >> on a minute by minute basis. >> >> 2. Is the "messages-behind-high-watermark" metric non-zero? >>> >> -[David] Yes. >> >>> 3. The SamzaContainerMetrics might be useful too. Particularly >>> "choose-ns" and "commit-ns" >>> >> -[David] We are seeing the following from one of the containers (after >> the SSD fix mentioned above): >> choose-ns=61353 >> commit-ns=306328 (what does this metric indicate? Is this in ms?) >> process-ns=248260 >> window-ns=150717 >> >>> 4. The only time I've personally seen slowness on the producer is if >>> it's configured for acks="all". What is the producer config from the >>> log? >>> >> - [David] We did not override this. So should be the default value (1?). >> >> 5. The window time is high, but since it's only called once per minute, >>> it looks like it only represents 1% of the event loop utilization. So >>> I >>> don't think that's a smoking gun. >>> >>> -Jake >>> >>> On Tue, Aug 23, 2016 at 9:18 AM, David Yu <david...@optimizely.com> >>> wrote: >>> >>> > Dear Samza guys, >>> > >>> > We are here for some debugging suggestions on our Samza job (0.10.0), >>> which >>> > lags behind on consumption after running for a couple of hours, >>> regardless >>> > of the number of containers allocated (currently 5). >>> > >>> > Briefly, the job aggregates events into sessions (in Avro) during >>> process() >>> > and emits snapshots of the open sessions using window() every minute. >>> This >>> > graph >>> > <https://www.dropbox.com/s/utywr1j5eku0ec0/Screenshot% >>> > 202016-08-23%2010.33.16.png?dl=0> >>> > shows >>> > you where processing started to lag (red is the number of events >>> received >>> > and green is the number of event processed). The end result is a steady >>> > increase of the consumer lag >>> > <https://www.dropbox.com/s/fppsv91c339xmdb/Screenshot% >>> > 202016-08-23%2010.19.27.png?dl=0>. >>> > What we are trying to track down is where the performance bottleneck >>> is. >>> > But it's unclear at the moment if that's in Samza or in Kafka. >>> > >>> > What we know so far: >>> > >>> > - Kafka producer seems to take a while writing to the downstream >>> topic >>> > (changelog and session snapshots) shown by various timers. Not sure >>> > which >>> > numbers are critical but here are the producer metrics >>> > <https://www.dropbox.com/s/pzi9304gw5vmae2/Screenshot% >>> > 202016-08-23%2010.57.33.png?dl=0> >>> > from >>> > one container. >>> > - avg windowing duration peaks at one point during the day (due to >>> the >>> > number of open sessions) but everything is still sub-seconds >>> > <https://www.dropbox.com/s/y2ps6pbs1tf257e/Screenshot% >>> > 202016-08-23%2010.44.19.png?dl=0> >>> > . >>> > - our Kafka cluster doesn't seem to be overloaded >>> > <https://www.dropbox.com/s/q01b4p4rg43spua/Screenshot% >>> > 202016-08-23%2010.48.25.png?dl=0> >>> > with writes < 60MB/s across all three brokers >>> > >>> > From all we know, we suspected that the bottleneck happens at >>> producing to >>> > Kafka. But we need some help confirming that. >>> > >>> > Any suggestion is appreciated. >>> > >>> > David >>> > >>> >>