Re: [DISCUSSION] Performance issue with cluster-wide cache metrics distribution

Zhenya Stanilovsky Tue, 04 Dec 2018 04:18:02 -0800

hi, Alex.
imo:
1. metrics through discovery require refactoring.
2. local cache metrics should be available (if configured) on each node.
3. there must be an opportunity to configure metrics in runtime.


thanks.


>
>
>Hi Igniters,
>
>In the current implementation, cache metrics are collected on each node and
>sent across the whole cluster with discovery message
>(TcpDiscoveryMetricsUpdateMessage) with configured frequency
>(MetricsUpdateFrequency, 2 seconds by default) even if no one requested
>them.
>If there are a lot of caches and a lot of nodes in the cluster, metrics
>update message (which contain each metric for each cache on each node) can
>reach a critical size.
>
>Also frequently collecting all cache metrics have a negative performance
>impact (some of them just get values from AtomicLong, but some of them need
>an iteration over all cache partitions).
>The only way now to disable cache metrics collecting and sending with
>discovery message is to disable statistics for each cache. But this also
>makes impossible to request some of cache metrics locally (for the current
>node only). Requesting a limited set of cache metrics on the current node
>doesn't have such performance impact as the frequent collecting of all
>cache metrics, but sometimes it's enough for diagnostic purposes.
>
>As a workaround I have filled and implemented ticket [1], which introduces
>new system property to disable cache metrics sending with
>TcpDiscoveryMetricsUpdateMessage (in case this property is set, the message
>will contain only node metrics). But system property is not good for a
>permanent solution. Perhaps it's better to move such property to public API
>(to IgniteConfiguration for example).
>
>Also maybe we should change cache metrics distributing strategy? For
>example, collect metrics by request via communication SPI or subscribe to a
>limited set of cache/metrics, etc.
>
>Thoughts?
>
>[1]:  https://issues.apache.org/jira/browse/IGNITE-10172


-- 
Zhenya Stanilovsky

Re: [DISCUSSION] Performance issue with cluster-wide cache metrics distribution

Reply via email to