[ https://issues.apache.org/jira/browse/IGNITE-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304502#comment-16304502 ]
Aleksey Plekhanov commented on IGNITE-6923: ------------------------------------------- There were 4 fixes made: # Each cluster metrics calculation invoked the DiscoveryMetricsProvider#cacheMetrics method, which creates metrics snapshots for all caches and is heavy, when a large number of caches are started. But only one cache metric is needed for cluster metrics calculation (getOffHeapAllocatedSize), so invocation of cacheMetrics is replaced with a direct call to getOffHeapAllocatedSize. # Each invocation of DiscoveryMetricsProvider#metrics creates cluster metrics snapshot. It’s ok for discovery update message (because all metrics are serialized and calculation of all metrics is needed), but sometimes this is redundant. For example, call to locNode.metrics().getTotalCpus() calculates each cluster metric, but uses only a total CPU count, which can be calculated at a lower cost. To solve this problem DiscoveryMetricsProvider#metrics method does not calculate snapshot of all metrics now, but returns a ClusterMetricsImpl class instance, which calculates each metric on demand. # Some cache metrics (entities count, partitions count) iterates through local partitions to get the result. There are 9 such methods which uses this iteration per each call. But it is possible to calculate this 9 metrics using one iteration over partitions. The new method was implemented to calculate all this metrics at once, in CacheMetricsSnapshot individual calculations of this metrics were replaced by the new method. # If there are a lot of nodes and a lot of caches in the cluster, size of TcpDiscoveryMetricsUpdateMessage can be rather large (up to hundreds of megabytes), because it contains information about each metric for each cache on each node. Sending cache metrics in the discovery message can be turned off by disabling statistics for caches, but it also makes unavailable cache metrics via JMX. New system property was added to disable cache metrics update in the discovery message without disabling statistics. Some benchmarks: Environment: 2 nodes, 200 caches (with statistics enabled), 1024 partitions per cache, 10000 job metrics snapshots. * First optimization (direct call to getOffHeapAllocatedSize) Subject: DiscoveryMetricsProvider.metrics() method Before optimization: 17 operations per second After optimization: 8000 operations per second * Second optimization (ClusterMetricsImpl instead of ClusterMetricsSnapshot) Subject: DiscoveryMetricsProvider.metrics().getTotalCpus() Before optimization: 8000 operations per second After optimization: 2000000 operations per second But individual call to getTotalCpus() is relatively rare, in most cases DiscoveryMetricsProvider.metrics() used for sending TcpDiscoveryMetricsUpdateMessage and performance of ClusterMetricsSnapshot.serialize(DiscoveryMetricsProvider.metrics()) left the same (8000 operations per second). Perhaps after the first optimization this optimization is no longer needed? * Third optimization (one iteration over partitions) Subject: DiscoveryMetricsProvider.cacheMetrics() Before optimization: 17 operations per second After optimization: 75 operations per second > Cache metrics are updated in timeout-worker potentially delaying critical > code execution due to current implementation issues. > ------------------------------------------------------------------------------------------------------------------------------ > > Key: IGNITE-6923 > URL: https://issues.apache.org/jira/browse/IGNITE-6923 > Project: Ignite > Issue Type: Improvement > Affects Versions: 2.3 > Reporter: Alexei Scherbakov > Assignee: Aleksey Plekhanov > Labels: iep-6 > Fix For: 2.4 > > > Some metrics are using cache iteration for calculation. If number of caches > rather large this can be slow. > Similar code is running in discovery thread. > See stack trace for example. > {noformat} > "grid-timeout-worker-#39%DPL_GRID%DplGridNodeName%" #152 prio=5 os_prio=0 > tid=0x00007f1009a03000 nid=0x5caa runnable [0x00007f0f059d9000] > java.lang.Thread.State: RUNNABLE > at java.util.HashMap.containsKey(HashMap.java:595) > at java.util.HashSet.contains(HashSet.java:203) > at > java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) > at > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.apply(IgniteCacheOffheapManagerImpl.java:339) > at > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.apply(IgniteCacheOffheapManagerImpl.java:337) > at > org.apache.ignite.internal.util.lang.gridfunc.TransformFilteringIterator.hasNext:@TransformFilteringIterator.java:90) > at > org.apache.ignite.internal.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45) > > at > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.cacheEntriesCount(IgniteCacheOffheapManagerImpl.java:293) > at > org.apache.ignite.internal.processors.cache.CacheMetricsImpl.getOffHeapPrimaryEntriesCount(CacheMetricsImpl.java:240) > at > org.apache.ignite.internal.processors.cache.CacheMetricsSnapshot.<init>(CacheMetricsSnapshot.java:271) > > at > org.apache.ignite.internal.processors.cache.GridCacheAdapter.localMetrics(GridCacheAdapter.java:3217) > > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.cacheMetrics(GridDiscoveryManager.java:1151) > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.nonHeapMemoryUsed(GridDiscoveryManager.java:1121) > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.metrics(GridDiscoveryManager.java:1087) > > at > org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.metrics(TcpDiscoveryNode.java:269) > > at > org.apache.ignite.internal.IgniteKernal$3.run(IgniteKernal.java:1175) > at > org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$CancelableTask.onTimeout(GridTimeoutProcessor.java:256) > - locked <0x00007f115f5bf890> (a > org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$CancelableTask) > at > org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:158) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > at java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)