Re: Debug Samza consumer lag issue

David Yu Wed, 24 Aug 2016 08:53:24 -0700

More updates:
1. process-envelopes rate finally stabilized and converged. Consumer lag is
down to zero.
2. avg choose-ns across containers dropped overtime
<https://www.dropbox.com/s/z4iiilvd7c1wrjc/Screenshot%202016-08-24%2010.46.22.png?dl=0>,
which I assume is a good thing.


My question:
Is choose-ns the total number of ms used to choose a message from the input
stream? What are some gating factors (e.g. serialization?) for this metric?

Thanks,
David

On Wed, Aug 24, 2016 at 12:34 AM David Yu <david...@optimizely.com> wrote:

> Some metric updates:
> 1. We started seeing some containers with a higher choose-ns
> <https://www.dropbox.com/s/06n3awdwn8ntfxd/Screenshot%202016-08-24%2000.26.07.png?dl=0>.
> Not sure what would be the cause of this.
> 2. We are seeing very different process-envelopes values across containers
> <https://www.dropbox.com/s/n1wxtngquv607nb/Screenshot%202016-08-24%2000.21.05.png?dl=0>
> .
>
>
>
> On Tue, Aug 23, 2016 at 5:56 PM David Yu <david...@optimizely.com> wrote:
>
>> Hi, Jake,
>>
>> Thanks for your suggestions. Some of my answers inline:
>>
>> 1.
>> On Tue, Aug 23, 2016 at 11:53 AM Jacob Maes <jacob.m...@gmail.com> wrote:
>>
>>> Hey David,
>>>
>>> A few initial thoughts/questions:
>>>
>>>
>>>    1. Is this job using RocksDB to store the aggregations? If so, is it
>>>    running on a machine with SSDs? We've seen a few performance issues
>>> related
>>>    to RocksDB.
>>>       1. Not running on SSD causes slowness on disk
>>
>>  - [David] This definitely pointed me to the right direction in my
>> investigation. We do use RocksDB and just realized that our YARN cluster is
>> using c3.xlarge EC2 instances and is using a mixture of EBS and local SSD
>> storage. After digging around, we noticed that some containers were
>> persisting their KV stores in SSD while others were using EBS. We just
>> updated our YARN config to use SSD only before redeployed our jobs. So far
>> everything looks good. Will report back on the performance update.
>>
>>>       2. Prior to Samza 10.1, samza would excessively flush the store to
>>>       disk, causing RocksDB compaction issues (stalls) - SAMZA-957
>>>
>> - [David] We did notice that the log cleaner thread died on one of our
>> brokers. Not sure if this was the same problem you pointed out. Following
>> errors are logged:
>>
>> 2016-08-15 10:00:56,475 ERROR kafka.log.LogCleaner:
>> [kafka-log-cleaner-thread-0], Error due to
>>
>> java.lang.IllegalArgumentException: requirement failed: 5865800 messages
>> in segment session-store-2.0-tickets-changelog-9/00000000000009548937.log
>> but offset map can fit only 5033164. You can increase
>> log.cleaner.dedupe.buffer.size or decrease log.cleaner.threads
>>
>>         at scala.Predef$.require(Predef.scala:219)
>>
>> We had to cleanup the changelog topic and restart the broker to bring
>> back the cleaner thread.
>>
>>>       3. When the RocksDB store is used as a queue, the iterator can
>>> suffer
>>>       performance issues due to RocksDBs tombstoning. (
>>>
>>> https://github.com/facebook/rocksdb/wiki/Implement-Queue-Service-Using-RocksDB
>>>       )
>>>
>> - [David] We use RocksDB to keep track of opening sessions and use
>> sessionId (a random hash) as the key. In that sense, this does not sound
>> like a queue. But we do iterate and delete closed sessions during windowing
>> on a minute by minute basis.
>>
>>    2. Is the "messages-behind-high-watermark" metric non-zero?
>>>
>> -[David] Yes.
>>
>>>    3. The SamzaContainerMetrics might be useful too. Particularly
>>>    "choose-ns" and "commit-ns"
>>>
>> -[David] We are seeing the following from one of the containers (after
>> the SSD fix mentioned above):
>> choose-ns=61353
>> commit-ns=306328 (what does this metric indicate? Is this in ms?)
>> process-ns=248260
>> window-ns=150717
>>
>>>    4. The only time I've personally seen slowness on the producer is if
>>>    it's configured for acks="all". What is the producer config from the
>>> log?
>>>
>> - [David] We did not override this. So should be the default value (1?).
>>
>>    5. The window time is high, but since it's only called once per minute,
>>>    it looks like it only represents 1% of the event loop utilization. So
>>> I
>>>    don't think that's a smoking gun.
>>>
>>> -Jake
>>>
>>> On Tue, Aug 23, 2016 at 9:18 AM, David Yu <david...@optimizely.com>
>>> wrote:
>>>
>>> > Dear Samza guys,
>>> >
>>> > We are here for some debugging suggestions on our Samza job (0.10.0),
>>> which
>>> > lags behind on consumption after running for a couple of hours,
>>> regardless
>>> > of the number of containers allocated (currently 5).
>>> >
>>> > Briefly, the job aggregates events into sessions (in Avro) during
>>> process()
>>> > and emits snapshots of the open sessions using window() every minute.
>>> This
>>> > graph
>>> > <https://www.dropbox.com/s/utywr1j5eku0ec0/Screenshot%
>>> > 202016-08-23%2010.33.16.png?dl=0>
>>> > shows
>>> > you where processing started to lag (red is the number of events
>>> received
>>> > and green is the number of event processed). The end result is a steady
>>> > increase of the consumer lag
>>> > <https://www.dropbox.com/s/fppsv91c339xmdb/Screenshot%
>>> > 202016-08-23%2010.19.27.png?dl=0>.
>>> > What we are trying to track down is where the performance bottleneck
>>> is.
>>> > But it's unclear at the moment if that's in Samza or in Kafka.
>>> >
>>> > What we know so far:
>>> >
>>> >    - Kafka producer seems to take a while writing to the downstream
>>> topic
>>> >    (changelog and session snapshots) shown by various timers. Not sure
>>> > which
>>> >    numbers are critical but here are the producer metrics
>>> >    <https://www.dropbox.com/s/pzi9304gw5vmae2/Screenshot%
>>> > 202016-08-23%2010.57.33.png?dl=0>
>>> > from
>>> >    one container.
>>> >    - avg windowing duration peaks at one point during the day (due to
>>> the
>>> >    number of open sessions) but everything is still sub-seconds
>>> >    <https://www.dropbox.com/s/y2ps6pbs1tf257e/Screenshot%
>>> > 202016-08-23%2010.44.19.png?dl=0>
>>> >    .
>>> >    - our Kafka cluster doesn't seem to be overloaded
>>> >    <https://www.dropbox.com/s/q01b4p4rg43spua/Screenshot%
>>> > 202016-08-23%2010.48.25.png?dl=0>
>>> >     with writes < 60MB/s across all three brokers
>>> >
>>> > From all we know, we suspected that the bottleneck happens at
>>> producing to
>>> > Kafka. But we need some help confirming that.
>>> >
>>> > Any suggestion is appreciated.
>>> >
>>> > David
>>> >
>>>
>>

Re: Debug Samza consumer lag issue

Reply via email to