Make sense. Thanks for the help, Jake! On Wed, Aug 24, 2016 at 5:11 PM Jacob Maes <jacob.m...@gmail.com> wrote:
> We don't have any hard guidelines around that metric just because there are > no hard rules that work for every job. For example, some jobs are very > bursty and need to keep up with huge traffic ramp-ups even though they're > underutilized the rest of the time. > > That said, yes, I have used that metric to determine whether a job has too > much parallelism. But it was a job that had very stable throughput patterns > and didn't have any major time spent in the window or commit methods, which > could cause periodic spikes in utilization. > > > > On Wed, Aug 24, 2016 at 2:55 PM, David Yu <david...@optimizely.com> wrote: > > > Interesting. > > > > To me, "event-loop-utilization" looks like a good indicator that shows us > > how busy the containers are. Is it safe to use this metric as a reference > > when we need to scale out/in our job? For example, if I'm seeing around > 0.3 > > utilization most of the time, maybe I can decrease the # of containers > and > > save some resources? > > > > Thanks, > > David > > > > On Wed, Aug 24, 2016 at 1:27 PM Jacob Maes <jacob.m...@gmail.com> wrote: > > > > > > > > > > Based on what you have described, the following should be true in > > 0.10.1: > > > > event-loop-ns = choose-ns + process-ns + window-ns (if necessary) + > > > > commit-ns (if necessary) > > > > > > Yes, plus any time (e.g. due to an unlucky GC at just the right moment) > > > that happens outside those timers. And no "if necessary" for window or > > > commit. There will be a small value for those methods even if they > don't > > do > > > anything significant because the timer runs even for no-ops > > > > > > Since you're on 10.1, there's another useful metric > > > "event-loop-utilization", which represents > > > (process-ns+window-ns+commit-ns)/event-loop-ns > > > (as you defined it). In other words, the proportion of time spend > working > > > vs waiting. > > > > > > On Wed, Aug 24, 2016 at 10:18 AM, David Yu <david...@optimizely.com> > > > wrote: > > > > > > > Great. It all makes sense now. > > > > > > > > With the SSD fix, we also upgrade to 0.10.1. So we should see pretty > > > > consistent process-ns (which we do). > > > > > > > > Based on what you have described, the following should be true in > > 0.10.1: > > > > event-loop-ns = choose-ns + process-ns + window-ns (if necessary) + > > > > commit-ns (if necessary) > > > > > > > > Is this correct? > > > > Thanks, > > > > David > > > > > > > > On Wed, Aug 24, 2016 at 11:27 AM Jacob Maes <jacob.m...@gmail.com> > > > wrote: > > > > > > > > > A couple other notes. > > > > > > > > > > Prior to Samza 10.1, the choose-ns was part of process-ns. So when > > > > > choose-ns and process-ns are both high (around 10,000,000 == 10ms, > > > which > > > > is > > > > > the default poll timeout), that usually means the task is caught > up. > > In > > > > > Samza 10.1 the same is true if ONLY choose-ns is high. process-ns > is > > > > always > > > > > the time spent in the process() method. > > > > > > > > > > Based on the above, the metric numbers you provided after the SSD > fix > > > all > > > > > look reasonable. They're all sub-millisecond and since choose-ns > and > > > > > process-ns are low, it seems that the container is chewing through > > > > messages > > > > > at a good rate. > > > > > > > > > > So I would conclude that the SSD fix was probably the right one and > > it > > > > just > > > > > took the job a while to catch up to the backlog of messages. Once > it > > > > caught > > > > > up, the choose-ns and process-ns increased, which is normal when > the > > > > > processor is waiting for new messages. > > > > > > > > > > -Jake > > > > > > > > > > On Wed, Aug 24, 2016 at 9:05 AM, Jacob Maes <jacob.m...@gmail.com> > > > > wrote: > > > > > > > > > > > Hey David, > > > > > > > > > > > > Answering the most recent question first, since it's also the > > > easiest. > > > > > :-) > > > > > > > > > > > > Is choose-ns the total number of ms used to choose a message from > > the > > > > > input > > > > > >> stream? What are some gating factors (e.g. serialization?) for > > this > > > > > >> metric? > > > > > > > > > > > > It's the amount of time the event loop spent getting new > messsages > > > for > > > > > > process(). It includes deserialization time and poll time which > we > > > > added > > > > > > new metrics for, in Samza 10.1. Typically deserialization time is > > > > pretty > > > > > > consistent, so when you see a spike in choose-ns, it's usually > > > because > > > > > the > > > > > > event loop is waiting for new messages. The two most common cases > > > when > > > > > it's > > > > > > waiting are: > > > > > > 1. There are no new messages in the topic partition. This is good > > > > because > > > > > > it means the processor is caught up. > > > > > > 2. The consumer is slow and/or the buffer isn't large enough so > the > > > > > > BrokerProxy isn't able to keep enough messages buffered to keep > the > > > > event > > > > > > loop busy. This is uncommon because the buffer is defaulted to > > 50,000 > > > > > > messages, which should be plenty. But if it happens, it's bad. To > > > > control > > > > > > this behavior, see the following properties in the config table ( > > > > > > http://samza.apache.org/learn/documentation/0.10/jobs/ > > > > > > configuration-table.html) > > > > > > systems.system-name.samza.fetch.threshold > > > > > > task.poll.interval.ms > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 24, 2016 at 8:52 AM, David Yu < > david...@optimizely.com > > > > > > > > wrote: > > > > > > > > > > > >> More updates: > > > > > >> 1. process-envelopes rate finally stabilized and converged. > > Consumer > > > > lag > > > > > >> is > > > > > >> down to zero. > > > > > >> 2. avg choose-ns across containers dropped overtime > > > > > >> <https://www.dropbox.com/s/z4iiilvd7c1wrjc/Screenshot%202016 > > > > > >> -08-24%2010.46.22.png?dl=0>, > > > > > >> which I assume is a good thing. > > > > > >> > > > > > >> My question: > > > > > >> Is choose-ns the total number of ms used to choose a message > from > > > the > > > > > >> input > > > > > >> stream? What are some gating factors (e.g. serialization?) for > > this > > > > > >> metric? > > > > > >> > > > > > >> Thanks, > > > > > >> David > > > > > >> > > > > > >> On Wed, Aug 24, 2016 at 12:34 AM David Yu < > > david...@optimizely.com> > > > > > >> wrote: > > > > > >> > > > > > >> > Some metric updates: > > > > > >> > 1. We started seeing some containers with a higher choose-ns > > > > > >> > <https://www.dropbox.com/s/06n3awdwn8ntfxd/Screenshot%202016 > > > > > >> -08-24%2000.26.07.png?dl=0>. > > > > > >> > Not sure what would be the cause of this. > > > > > >> > 2. We are seeing very different process-envelopes values > across > > > > > >> containers > > > > > >> > <https://www.dropbox.com/s/n1wxtngquv607nb/Screenshot%202016 > > > > > >> -08-24%2000.21.05.png?dl=0> > > > > > >> > . > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > On Tue, Aug 23, 2016 at 5:56 PM David Yu < > > david...@optimizely.com > > > > > > > > > >> wrote: > > > > > >> > > > > > > >> >> Hi, Jake, > > > > > >> >> > > > > > >> >> Thanks for your suggestions. Some of my answers inline: > > > > > >> >> > > > > > >> >> 1. > > > > > >> >> On Tue, Aug 23, 2016 at 11:53 AM Jacob Maes < > > > jacob.m...@gmail.com> > > > > > >> wrote: > > > > > >> >> > > > > > >> >>> Hey David, > > > > > >> >>> > > > > > >> >>> A few initial thoughts/questions: > > > > > >> >>> > > > > > >> >>> > > > > > >> >>> 1. Is this job using RocksDB to store the aggregations? > If > > > so, > > > > is > > > > > >> it > > > > > >> >>> running on a machine with SSDs? We've seen a few > > performance > > > > > issues > > > > > >> >>> related > > > > > >> >>> to RocksDB. > > > > > >> >>> 1. Not running on SSD causes slowness on disk > > > > > >> >> > > > > > >> >> - [David] This definitely pointed me to the right direction > in > > > my > > > > > >> >> investigation. We do use RocksDB and just realized that our > > YARN > > > > > >> cluster is > > > > > >> >> using c3.xlarge EC2 instances and is using a mixture of EBS > and > > > > local > > > > > >> SSD > > > > > >> >> storage. After digging around, we noticed that some > containers > > > were > > > > > >> >> persisting their KV stores in SSD while others were using > EBS. > > We > > > > > just > > > > > >> >> updated our YARN config to use SSD only before redeployed our > > > jobs. > > > > > So > > > > > >> far > > > > > >> >> everything looks good. Will report back on the performance > > > update. > > > > > >> >> > > > > > >> >>> 2. Prior to Samza 10.1, samza would excessively flush > > the > > > > > store > > > > > >> to > > > > > >> >>> disk, causing RocksDB compaction issues (stalls) - > > > SAMZA-957 > > > > > >> >>> > > > > > >> >> - [David] We did notice that the log cleaner thread died on > one > > > of > > > > > our > > > > > >> >> brokers. Not sure if this was the same problem you pointed > out. > > > > > >> Following > > > > > >> >> errors are logged: > > > > > >> >> > > > > > >> >> 2016-08-15 10:00:56,475 ERROR kafka.log.LogCleaner: > > > > > >> >> [kafka-log-cleaner-thread-0], Error due to > > > > > >> >> > > > > > >> >> java.lang.IllegalArgumentException: requirement failed: > > 5865800 > > > > > >> messages > > > > > >> >> in segment > > > > > session-store-2.0-tickets-changelog-9/00000000000009548937. > > > > > >> log > > > > > >> >> but offset map can fit only 5033164. You can increase > > > > > >> >> log.cleaner.dedupe.buffer.size or decrease > log.cleaner.threads > > > > > >> >> > > > > > >> >> at scala.Predef$.require(Predef.scala:219) > > > > > >> >> > > > > > >> >> We had to cleanup the changelog topic and restart the broker > to > > > > bring > > > > > >> >> back the cleaner thread. > > > > > >> >> > > > > > >> >>> 3. When the RocksDB store is used as a queue, the > > iterator > > > > can > > > > > >> >>> suffer > > > > > >> >>> performance issues due to RocksDBs tombstoning. ( > > > > > >> >>> > > > > > >> >>> > https://github.com/facebook/rocksdb/wiki/Implement-Queue-Ser > > > > > >> vice-Using-RocksDB > > > > > >> >>> ) > > > > > >> >>> > > > > > >> >> - [David] We use RocksDB to keep track of opening sessions > and > > > use > > > > > >> >> sessionId (a random hash) as the key. In that sense, this > does > > > not > > > > > >> sound > > > > > >> >> like a queue. But we do iterate and delete closed sessions > > during > > > > > >> windowing > > > > > >> >> on a minute by minute basis. > > > > > >> >> > > > > > >> >> 2. Is the "messages-behind-high-watermark" metric > non-zero? > > > > > >> >>> > > > > > >> >> -[David] Yes. > > > > > >> >> > > > > > >> >>> 3. The SamzaContainerMetrics might be useful too. > > > Particularly > > > > > >> >>> "choose-ns" and "commit-ns" > > > > > >> >>> > > > > > >> >> -[David] We are seeing the following from one of the > containers > > > > > (after > > > > > >> >> the SSD fix mentioned above): > > > > > >> >> choose-ns=61353 > > > > > >> >> commit-ns=306328 (what does this metric indicate? Is this in > > ms?) > > > > > >> >> process-ns=248260 > > > > > >> >> window-ns=150717 > > > > > >> >> > > > > > >> >>> 4. The only time I've personally seen slowness on the > > > producer > > > > is > > > > > >> if > > > > > >> >>> it's configured for acks="all". What is the producer > config > > > > from > > > > > >> the > > > > > >> >>> log? > > > > > >> >>> > > > > > >> >> - [David] We did not override this. So should be the default > > > value > > > > > >> (1?). > > > > > >> >> > > > > > >> >> 5. The window time is high, but since it's only called > once > > > per > > > > > >> minute, > > > > > >> >>> it looks like it only represents 1% of the event loop > > > > > utilization. > > > > > >> So > > > > > >> >>> I > > > > > >> >>> don't think that's a smoking gun. > > > > > >> >>> > > > > > >> >>> -Jake > > > > > >> >>> > > > > > >> >>> On Tue, Aug 23, 2016 at 9:18 AM, David Yu < > > > > david...@optimizely.com> > > > > > >> >>> wrote: > > > > > >> >>> > > > > > >> >>> > Dear Samza guys, > > > > > >> >>> > > > > > > >> >>> > We are here for some debugging suggestions on our Samza > job > > > > > >> (0.10.0), > > > > > >> >>> which > > > > > >> >>> > lags behind on consumption after running for a couple of > > > hours, > > > > > >> >>> regardless > > > > > >> >>> > of the number of containers allocated (currently 5). > > > > > >> >>> > > > > > > >> >>> > Briefly, the job aggregates events into sessions (in Avro) > > > > during > > > > > >> >>> process() > > > > > >> >>> > and emits snapshots of the open sessions using window() > > every > > > > > >> minute. > > > > > >> >>> This > > > > > >> >>> > graph > > > > > >> >>> > <https://www.dropbox.com/s/utywr1j5eku0ec0/Screenshot% > > > > > >> >>> > 202016-08-23%2010.33.16.png?dl=0> > > > > > >> >>> > shows > > > > > >> >>> > you where processing started to lag (red is the number of > > > events > > > > > >> >>> received > > > > > >> >>> > and green is the number of event processed). The end > result > > > is a > > > > > >> steady > > > > > >> >>> > increase of the consumer lag > > > > > >> >>> > <https://www.dropbox.com/s/fppsv91c339xmdb/Screenshot% > > > > > >> >>> > 202016-08-23%2010.19.27.png?dl=0>. > > > > > >> >>> > What we are trying to track down is where the performance > > > > > bottleneck > > > > > >> >>> is. > > > > > >> >>> > But it's unclear at the moment if that's in Samza or in > > Kafka. > > > > > >> >>> > > > > > > >> >>> > What we know so far: > > > > > >> >>> > > > > > > >> >>> > - Kafka producer seems to take a while writing to the > > > > > downstream > > > > > >> >>> topic > > > > > >> >>> > (changelog and session snapshots) shown by various > > timers. > > > > Not > > > > > >> sure > > > > > >> >>> > which > > > > > >> >>> > numbers are critical but here are the producer metrics > > > > > >> >>> > <https://www.dropbox.com/s/pzi9304gw5vmae2/Screenshot% > > > > > >> >>> > 202016-08-23%2010.57.33.png?dl=0> > > > > > >> >>> > from > > > > > >> >>> > one container. > > > > > >> >>> > - avg windowing duration peaks at one point during the > > day > > > > (due > > > > > >> to > > > > > >> >>> the > > > > > >> >>> > number of open sessions) but everything is still > > > sub-seconds > > > > > >> >>> > <https://www.dropbox.com/s/y2ps6pbs1tf257e/Screenshot% > > > > > >> >>> > 202016-08-23%2010.44.19.png?dl=0> > > > > > >> >>> > . > > > > > >> >>> > - our Kafka cluster doesn't seem to be overloaded > > > > > >> >>> > <https://www.dropbox.com/s/q01b4p4rg43spua/Screenshot% > > > > > >> >>> > 202016-08-23%2010.48.25.png?dl=0> > > > > > >> >>> > with writes < 60MB/s across all three brokers > > > > > >> >>> > > > > > > >> >>> > From all we know, we suspected that the bottleneck happens > > at > > > > > >> >>> producing to > > > > > >> >>> > Kafka. But we need some help confirming that. > > > > > >> >>> > > > > > > >> >>> > Any suggestion is appreciated. > > > > > >> >>> > > > > > > >> >>> > David > > > > > >> >>> > > > > > > >> >>> > > > > > >> >> > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >