[
https://issues.apache.org/jira/browse/IGNITE-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304502#comment-16304502
]
Aleksey Plekhanov edited comment on IGNITE-6923 at 12/27/17 12:33 PM:
----------------------------------------------------------------------
There were 4 fixes made:
# Each cluster metrics calculation invoked the
{DiscoveryMetricsProvider#cacheMetrics} method, which creates metrics snapshots
for all caches and is heavy, when a large number of caches are started. But
only one cache metric is needed for cluster metrics calculation
({getOffHeapAllocatedSize}), so invocation of {cacheMetrics} is replaced with a
direct call to {getOffHeapAllocatedSize}.
# Each invocation of {DiscoveryMetricsProvider#metrics} creates cluster metrics
snapshot. It’s ok for discovery update message (because all metrics are
serialized and calculation of all metrics is needed), but sometimes this is
redundant. For example, call to {locNode.metrics().getTotalCpus()} calculates
each cluster metric, but uses only a total CPU count, which can be calculated
at a lower cost. To solve this problem {DiscoveryMetricsProvider#metrics}
method does not calculate snapshot of all metrics now, but returns a
{ClusterMetricsImpl} class instance, which calculates each metric on demand.
# Some cache metrics (entities count, partitions count) iterates through local
partitions to get the result. There are 9 such methods which uses this
iteration per each call. But it is possible to calculate this 9 metrics using
one iteration over partitions. The new method was implemented to calculate all
this metrics at once, in {CacheMetricsSnapshot} individual calculations of this
metrics were replaced by the new method.
# If there are a lot of nodes and a lot of caches in the cluster, size of
{TcpDiscoveryMetricsUpdateMessage} can be rather large (up to hundreds of
megabytes), because it contains information about each metric for each cache on
each node. Sending cache metrics in the discovery message can be turned off by
disabling statistics for caches, but it also makes unavailable cache metrics
via JMX. New system property was added to disable cache metrics update in the
discovery message without disabling statistics.
Some benchmarks:
Environment: 2 nodes, 200 caches (with statistics enabled), 1024 partitions per
cache, 10000 job metrics snapshots.
* First optimization (direct call to {getOffHeapAllocatedSize})
Subject: {DiscoveryMetricsProvider.metrics()} method
Before optimization: 17 operations per second
After optimization: 8000 operations per second
* Second optimization ({ClusterMetricsImpl} instead of {ClusterMetricsSnapshot})
Subject: {DiscoveryMetricsProvider.metrics().getTotalCpus()}
Before optimization: 8000 operations per second
After optimization: 2000000 operations per second
But individual call to {getTotalCpus()} is relatively rare, in most cases
{DiscoveryMetricsProvider.metrics()} used for sending
{TcpDiscoveryMetricsUpdateMessage} and performance of
{ClusterMetricsSnapshot.serialize(DiscoveryMetricsProvider.metrics())} left the
same (8000 operations per second). Perhaps after the first optimization this
(second) optimization is no longer needed?
* Third optimization (one iteration over partitions)
Subject: {DiscoveryMetricsProvider.cacheMetrics()}
Before optimization: 17 operations per second
After optimization: 75 operations per second
was (Author: alex_pl):
There were 4 fixes made:
# Each cluster metrics calculation invoked the
DiscoveryMetricsProvider#cacheMetrics method, which creates metrics snapshots
for all caches and is heavy, when a large number of caches are started. But
only one cache metric is needed for cluster metrics calculation
(getOffHeapAllocatedSize), so invocation of cacheMetrics is replaced with a
direct call to getOffHeapAllocatedSize.
# Each invocation of DiscoveryMetricsProvider#metrics creates cluster metrics
snapshot. It’s ok for discovery update message (because all metrics are
serialized and calculation of all metrics is needed), but sometimes this is
redundant. For example, call to locNode.metrics().getTotalCpus() calculates
each cluster metric, but uses only a total CPU count, which can be calculated
at a lower cost. To solve this problem DiscoveryMetricsProvider#metrics method
does not calculate snapshot of all metrics now, but returns a
ClusterMetricsImpl class instance, which calculates each metric on demand.
# Some cache metrics (entities count, partitions count) iterates through local
partitions to get the result. There are 9 such methods which uses this
iteration per each call. But it is possible to calculate this 9 metrics using
one iteration over partitions. The new method was implemented to calculate all
this metrics at once, in CacheMetricsSnapshot individual calculations of this
metrics were replaced by the new method.
# If there are a lot of nodes and a lot of caches in the cluster, size of
TcpDiscoveryMetricsUpdateMessage can be rather large (up to hundreds of
megabytes), because it contains information about each metric for each cache on
each node. Sending cache metrics in the discovery message can be turned off by
disabling statistics for caches, but it also makes unavailable cache metrics
via JMX. New system property was added to disable cache metrics update in the
discovery message without disabling statistics.
Some benchmarks:
Environment: 2 nodes, 200 caches (with statistics enabled), 1024 partitions per
cache, 10000 job metrics snapshots.
* First optimization (direct call to getOffHeapAllocatedSize)
Subject: DiscoveryMetricsProvider.metrics() method
Before optimization: 17 operations per second
After optimization: 8000 operations per second
* Second optimization (ClusterMetricsImpl instead of ClusterMetricsSnapshot)
Subject: DiscoveryMetricsProvider.metrics().getTotalCpus()
Before optimization: 8000 operations per second
After optimization: 2000000 operations per second
But individual call to getTotalCpus() is relatively rare, in most cases
DiscoveryMetricsProvider.metrics() used for sending
TcpDiscoveryMetricsUpdateMessage and performance of
ClusterMetricsSnapshot.serialize(DiscoveryMetricsProvider.metrics()) left the
same (8000 operations per second). Perhaps after the first optimization this
optimization is no longer needed?
* Third optimization (one iteration over partitions)
Subject: DiscoveryMetricsProvider.cacheMetrics()
Before optimization: 17 operations per second
After optimization: 75 operations per second
> Cache metrics are updated in timeout-worker potentially delaying critical
> code execution due to current implementation issues.
> ------------------------------------------------------------------------------------------------------------------------------
>
> Key: IGNITE-6923
> URL: https://issues.apache.org/jira/browse/IGNITE-6923
> Project: Ignite
> Issue Type: Improvement
> Affects Versions: 2.3
> Reporter: Alexei Scherbakov
> Assignee: Aleksey Plekhanov
> Labels: iep-6
> Fix For: 2.4
>
>
> Some metrics are using cache iteration for calculation. If number of caches
> rather large this can be slow.
> Similar code is running in discovery thread.
> See stack trace for example.
> {noformat}
> "grid-timeout-worker-#39%DPL_GRID%DplGridNodeName%" #152 prio=5 os_prio=0
> tid=0x00007f1009a03000 nid=0x5caa runnable [0x00007f0f059d9000]
> java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.containsKey(HashMap.java:595)
> at java.util.HashSet.contains(HashSet.java:203)
> at
> java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032)
> at
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.apply(IgniteCacheOffheapManagerImpl.java:339)
> at
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.apply(IgniteCacheOffheapManagerImpl.java:337)
> at
> org.apache.ignite.internal.util.lang.gridfunc.TransformFilteringIterator.hasNext:@TransformFilteringIterator.java:90)
> at
> org.apache.ignite.internal.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45)
>
> at
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.cacheEntriesCount(IgniteCacheOffheapManagerImpl.java:293)
> at
> org.apache.ignite.internal.processors.cache.CacheMetricsImpl.getOffHeapPrimaryEntriesCount(CacheMetricsImpl.java:240)
> at
> org.apache.ignite.internal.processors.cache.CacheMetricsSnapshot.<init>(CacheMetricsSnapshot.java:271)
>
> at
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.localMetrics(GridCacheAdapter.java:3217)
>
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.cacheMetrics(GridDiscoveryManager.java:1151)
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.nonHeapMemoryUsed(GridDiscoveryManager.java:1121)
> at
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.metrics(GridDiscoveryManager.java:1087)
>
> at
> org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.metrics(TcpDiscoveryNode.java:269)
>
> at
> org.apache.ignite.internal.IgniteKernal$3.run(IgniteKernal.java:1175)
> at
> org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$CancelableTask.onTimeout(GridTimeoutProcessor.java:256)
> - locked <0x00007f115f5bf890> (a
> org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$CancelableTask)
> at
> org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:158)
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)