Re: [DISCUSSION] Performance issue with cluster-wide cache metrics distribution

Alex Plehanov Tue, 11 Dec 2018 06:17:41 -0800

Denis, I measure the impact of metrics collecting on my laptop, it's about
5 seconds on each node for collecting metrics of 1000 caches (all caches in
one cache group) with 32000 partitions. All this time tcp-disco-msg-worker
is blocked.


Guys, thanks for your proposals, I'd filled ticket [1].

[1]: https://issues.apache.org/jira/browse/IGNITE-10642


вт, 4 дек. 2018 г. в 18:07, Alexey Kuznetsov <[email protected]>:

> Hi,
>
> One of the problems with metrics is a huge size in case when a lot caches
> started on node (for example, I see 7000 caches).
> We have to think how to compact them.
> Not all metrics changed frequently, so, we may store locally and send over
> wire only a difference from previous collect.
>
> And think carefully about store format. For example, if current cache
> metrics will be passed as JSON object,
>  then 70% of it will be strings with metrics names.
>
>
> On Tue, Dec 4, 2018 at 7:22 PM Vladimir Ozerov <[email protected]>
> wrote:
>
> > Hi Alex,
> >
> > Agree with you. Most of the time these distribution of metrics is not
> > needed. In future we will have more and more information which
> potentially
> > needs to be shared between nodes. E.g. IO statistics, SQL statistics for
> > query optimizer, SQL execution history, etc. We need common mechanics for
> > this, so I vote for your proposal:
> > 1) Data is collected locally
> > 2) If a node needs to collect data from the cluster, it sends explicit
> > request over communication SPI
> > 3) For performance reasons we may consider caching - return previously
> > collected metrics without re-requesting them again if they are not too
> old
> > (configurable)
> >
> > On Tue, Dec 4, 2018 at 12:46 PM Alex Plehanov <[email protected]>
> > wrote:
> >
> > > Hi Igniters,
> > >
> > > In the current implementation, cache metrics are collected on each node
> > and
> > > sent across the whole cluster with discovery message
> > > (TcpDiscoveryMetricsUpdateMessage) with configured frequency
> > > (MetricsUpdateFrequency, 2 seconds by default) even if no one requested
> > > them.
> > > If there are a lot of caches and a lot of nodes in the cluster, metrics
> > > update message (which contain each metric for each cache on each node)
> > can
> > > reach a critical size.
> > >
> > > Also frequently collecting all cache metrics have a negative
> performance
> > > impact (some of them just get values from AtomicLong, but some of them
> > need
> > > an iteration over all cache partitions).
> > > The only way now to disable cache metrics collecting and sending with
> > > discovery message is to disable statistics for each cache. But this
> also
> > > makes impossible to request some of cache metrics locally (for the
> > current
> > > node only). Requesting a limited set of cache metrics on the current
> node
> > > doesn't have such performance impact as the frequent collecting of all
> > > cache metrics, but sometimes it's enough for diagnostic purposes.
> > >
> > > As a workaround I have filled and implemented ticket [1], which
> > introduces
> > > new system property to disable cache metrics sending with
> > > TcpDiscoveryMetricsUpdateMessage (in case this property is set, the
> > message
> > > will contain only node metrics). But system property is not good for a
> > > permanent solution. Perhaps it's better to move such property to public
> > API
> > > (to IgniteConfiguration for example).
> > >
> > > Also maybe we should change cache metrics distributing strategy? For
> > > example, collect metrics by request via communication SPI or subscribe
> > to a
> > > limited set of cache/metrics, etc.
> > >
> > > Thoughts?
> > >
> > > [1]: https://issues.apache.org/jira/browse/IGNITE-10172
> > >
> >
>
>
> --
> Alexey Kuznetsov
>

Re: [DISCUSSION] Performance issue with cluster-wide cache metrics distribution

Reply via email to