Hi Igniters, In the current implementation, cache metrics are collected on each node and sent across the whole cluster with discovery message (TcpDiscoveryMetricsUpdateMessage) with configured frequency (MetricsUpdateFrequency, 2 seconds by default) even if no one requested them. If there are a lot of caches and a lot of nodes in the cluster, metrics update message (which contain each metric for each cache on each node) can reach a critical size.
Also frequently collecting all cache metrics have a negative performance impact (some of them just get values from AtomicLong, but some of them need an iteration over all cache partitions). The only way now to disable cache metrics collecting and sending with discovery message is to disable statistics for each cache. But this also makes impossible to request some of cache metrics locally (for the current node only). Requesting a limited set of cache metrics on the current node doesn't have such performance impact as the frequent collecting of all cache metrics, but sometimes it's enough for diagnostic purposes. As a workaround I have filled and implemented ticket [1], which introduces new system property to disable cache metrics sending with TcpDiscoveryMetricsUpdateMessage (in case this property is set, the message will contain only node metrics). But system property is not good for a permanent solution. Perhaps it's better to move such property to public API (to IgniteConfiguration for example). Also maybe we should change cache metrics distributing strategy? For example, collect metrics by request via communication SPI or subscribe to a limited set of cache/metrics, etc. Thoughts? [1]: https://issues.apache.org/jira/browse/IGNITE-10172