Hi, One of the problems with metrics is a huge size in case when a lot caches started on node (for example, I see 7000 caches). We have to think how to compact them. Not all metrics changed frequently, so, we may store locally and send over wire only a difference from previous collect.
And think carefully about store format. For example, if current cache metrics will be passed as JSON object, then 70% of it will be strings with metrics names. On Tue, Dec 4, 2018 at 7:22 PM Vladimir Ozerov <voze...@gridgain.com> wrote: > Hi Alex, > > Agree with you. Most of the time these distribution of metrics is not > needed. In future we will have more and more information which potentially > needs to be shared between nodes. E.g. IO statistics, SQL statistics for > query optimizer, SQL execution history, etc. We need common mechanics for > this, so I vote for your proposal: > 1) Data is collected locally > 2) If a node needs to collect data from the cluster, it sends explicit > request over communication SPI > 3) For performance reasons we may consider caching - return previously > collected metrics without re-requesting them again if they are not too old > (configurable) > > On Tue, Dec 4, 2018 at 12:46 PM Alex Plehanov <plehanov.a...@gmail.com> > wrote: > > > Hi Igniters, > > > > In the current implementation, cache metrics are collected on each node > and > > sent across the whole cluster with discovery message > > (TcpDiscoveryMetricsUpdateMessage) with configured frequency > > (MetricsUpdateFrequency, 2 seconds by default) even if no one requested > > them. > > If there are a lot of caches and a lot of nodes in the cluster, metrics > > update message (which contain each metric for each cache on each node) > can > > reach a critical size. > > > > Also frequently collecting all cache metrics have a negative performance > > impact (some of them just get values from AtomicLong, but some of them > need > > an iteration over all cache partitions). > > The only way now to disable cache metrics collecting and sending with > > discovery message is to disable statistics for each cache. But this also > > makes impossible to request some of cache metrics locally (for the > current > > node only). Requesting a limited set of cache metrics on the current node > > doesn't have such performance impact as the frequent collecting of all > > cache metrics, but sometimes it's enough for diagnostic purposes. > > > > As a workaround I have filled and implemented ticket [1], which > introduces > > new system property to disable cache metrics sending with > > TcpDiscoveryMetricsUpdateMessage (in case this property is set, the > message > > will contain only node metrics). But system property is not good for a > > permanent solution. Perhaps it's better to move such property to public > API > > (to IgniteConfiguration for example). > > > > Also maybe we should change cache metrics distributing strategy? For > > example, collect metrics by request via communication SPI or subscribe > to a > > limited set of cache/metrics, etc. > > > > Thoughts? > > > > [1]: https://issues.apache.org/jira/browse/IGNITE-10172 > > > -- Alexey Kuznetsov