Re: [DISCUSSION] Performance issue with cluster-wide cache metrics distribution

Alexey Kuznetsov Tue, 04 Dec 2018 07:11:47 -0800

Hi,

One of the problems with metrics is a huge size in case when a lot caches
started on node (for example, I see 7000 caches).
We have to think how to compact them.
Not all metrics changed frequently, so, we may store locally and send over
wire only a difference from previous collect.


And think carefully about store format. For example, if current cache
metrics will be passed as JSON object,
 then 70% of it will be strings with metrics names.


On Tue, Dec 4, 2018 at 7:22 PM Vladimir Ozerov <voze...@gridgain.com> wrote:

> Hi Alex,
>
> Agree with you. Most of the time these distribution of metrics is not
> needed. In future we will have more and more information which potentially
> needs to be shared between nodes. E.g. IO statistics, SQL statistics for
> query optimizer, SQL execution history, etc. We need common mechanics for
> this, so I vote for your proposal:
> 1) Data is collected locally
> 2) If a node needs to collect data from the cluster, it sends explicit
> request over communication SPI
> 3) For performance reasons we may consider caching - return previously
> collected metrics without re-requesting them again if they are not too old
> (configurable)
>
> On Tue, Dec 4, 2018 at 12:46 PM Alex Plehanov <plehanov.a...@gmail.com>
> wrote:
>
> > Hi Igniters,
> >
> > In the current implementation, cache metrics are collected on each node
> and
> > sent across the whole cluster with discovery message
> > (TcpDiscoveryMetricsUpdateMessage) with configured frequency
> > (MetricsUpdateFrequency, 2 seconds by default) even if no one requested
> > them.
> > If there are a lot of caches and a lot of nodes in the cluster, metrics
> > update message (which contain each metric for each cache on each node)
> can
> > reach a critical size.
> >
> > Also frequently collecting all cache metrics have a negative performance
> > impact (some of them just get values from AtomicLong, but some of them
> need
> > an iteration over all cache partitions).
> > The only way now to disable cache metrics collecting and sending with
> > discovery message is to disable statistics for each cache. But this also
> > makes impossible to request some of cache metrics locally (for the
> current
> > node only). Requesting a limited set of cache metrics on the current node
> > doesn't have such performance impact as the frequent collecting of all
> > cache metrics, but sometimes it's enough for diagnostic purposes.
> >
> > As a workaround I have filled and implemented ticket [1], which
> introduces
> > new system property to disable cache metrics sending with
> > TcpDiscoveryMetricsUpdateMessage (in case this property is set, the
> message
> > will contain only node metrics). But system property is not good for a
> > permanent solution. Perhaps it's better to move such property to public
> API
> > (to IgniteConfiguration for example).
> >
> > Also maybe we should change cache metrics distributing strategy? For
> > example, collect metrics by request via communication SPI or subscribe
> to a
> > limited set of cache/metrics, etc.
> >
> > Thoughts?
> >
> > [1]: https://issues.apache.org/jira/browse/IGNITE-10172
> >
>


-- 
Alexey Kuznetsov

Re: [DISCUSSION] Performance issue with cluster-wide cache metrics distribution

Reply via email to