[DISCUSSION] Performance issue with cluster-wide cache metrics distribution

Alex Plehanov Tue, 04 Dec 2018 01:46:51 -0800

Hi Igniters,

In the current implementation, cache metrics are collected on each node and
sent across the whole cluster with discovery message
(TcpDiscoveryMetricsUpdateMessage) with configured frequency
(MetricsUpdateFrequency, 2 seconds by default) even if no one requested
them.
If there are a lot of caches and a lot of nodes in the cluster, metrics
update message (which contain each metric for each cache on each node) can
reach a critical size.


Also frequently collecting all cache metrics have a negative performance
impact (some of them just get values from AtomicLong, but some of them need
an iteration over all cache partitions).
The only way now to disable cache metrics collecting and sending with
discovery message is to disable statistics for each cache. But this also
makes impossible to request some of cache metrics locally (for the current
node only). Requesting a limited set of cache metrics on the current node
doesn't have such performance impact as the frequent collecting of all
cache metrics, but sometimes it's enough for diagnostic purposes.

As a workaround I have filled and implemented ticket [1], which introduces
new system property to disable cache metrics sending with
TcpDiscoveryMetricsUpdateMessage (in case this property is set, the message
will contain only node metrics). But system property is not good for a
permanent solution. Perhaps it's better to move such property to public API
(to IgniteConfiguration for example).

Also maybe we should change cache metrics distributing strategy? For
example, collect metrics by request via communication SPI or subscribe to a
limited set of cache/metrics, etc.

Thoughts?

[1]: https://issues.apache.org/jira/browse/IGNITE-10172

[DISCUSSION] Performance issue with cluster-wide cache metrics distribution

Reply via email to