[
https://issues.apache.org/jira/browse/SAMZA-963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fred Ji reassigned SAMZA-963:
-----------------------------
Assignee: Fred Ji
> Add timers to help identify performance issues with KV stores and producers.
> ----------------------------------------------------------------------------
>
> Key: SAMZA-963
> URL: https://issues.apache.org/jira/browse/SAMZA-963
> Project: Samza
> Issue Type: Improvement
> Reporter: Jake Maes
> Assignee: Fred Ji
>
> We have good timing metrics for many of the primary actions in the event loop:
> * Choose
> ** Deserialization
> ** Poll
> * Process
> * Window
> * Commit
> I've noticed a few things while analyzing job performance at LinkedIn:
> 1. We can usually identify problems in Choose using the sub metrics for
> Deserialization and Poll. I don't think any work needs to be done here.
> 2. Slowness in Process or Window is usually caused by business logic (e.g.
> side calls to remote DBs), but it can also be caused by slowness (e.g.
> "stalls" in the case of RocksDB) in the KV Store.
> 3. Slowness in Commit can be caused by slowness flushing the stores or
> producers. It can also come from checkpointing.
> #2 would be better if we had timers around all the main KV Store operations,
> including get, put, delete, and the batch operations. Then we can isolate KV
> Store performance from business logic performance.
> #3 would be improved if we had timers around all the flushes. Specifically, I
> think we should add a "flush-ns" metric to the KeyValueStoreMetrics and
> update it from each of the stores. I noticed that KafkaSystemProducerMetrics
> has a "flush-ns" metric, so the kafka producer is covered.
> To summarize, this ticket is to add metrics around all KV Store operations,
> not just for user operations like get/put, but flush as well.
> Related work: SAMZA-449
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)