[jira] [Comment Edited] (IGNITE-6923) Cache metrics are updated in timeout-worker potentially delaying critical code execution due to current implementation issues.

Aleksey Plekhanov (JIRA) Wed, 27 Dec 2017 04:35:03 -0800

    [ 
https://issues.apache.org/jira/browse/IGNITE-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304502#comment-16304502
 ]


Aleksey Plekhanov edited comment on IGNITE-6923 at 12/27/17 12:33 PM:
----------------------------------------------------------------------

There were 4 fixes made:

# Each cluster metrics calculation invoked the 
{DiscoveryMetricsProvider#cacheMetrics} method, which creates metrics snapshots 
for all caches and is heavy, when a large number of caches are started. But 
only one cache metric is needed for cluster metrics calculation 
({getOffHeapAllocatedSize}), so invocation of {cacheMetrics} is replaced with a 
direct call to {getOffHeapAllocatedSize}.
# Each invocation of {DiscoveryMetricsProvider#metrics} creates cluster metrics 
snapshot. It’s ok for discovery update message (because all metrics are 
serialized and calculation of all metrics is needed), but sometimes this is 
redundant. For example, call to {locNode.metrics().getTotalCpus()} calculates 
each cluster metric, but uses only a total CPU count, which can be calculated 
at a lower cost. To solve this problem {DiscoveryMetricsProvider#metrics} 
method does not calculate snapshot of all metrics now, but returns a 
{ClusterMetricsImpl} class instance, which calculates each metric on demand.
# Some cache metrics (entities count, partitions count) iterates through local 
partitions to get the result. There are 9 such methods which uses this 
iteration per each call. But it is possible to calculate this 9 metrics using 
one iteration over partitions. The new method was implemented to calculate all 
this metrics at once, in {CacheMetricsSnapshot} individual calculations of this 
metrics were replaced by the new method.
# If there are a lot of nodes and a lot of caches in the cluster, size of 
{TcpDiscoveryMetricsUpdateMessage} can be rather large (up to hundreds of 
megabytes), because it contains information about each metric for each cache on 
each node. Sending cache metrics in the discovery message can be turned off by 
disabling statistics for caches, but it also makes unavailable cache metrics 
via JMX. New system property was added to disable cache metrics update in the 
discovery message without disabling statistics.

Some benchmarks:

Environment: 2 nodes, 200 caches (with statistics enabled), 1024 partitions per 
cache, 10000 job metrics snapshots.

* First optimization (direct call to {getOffHeapAllocatedSize})
Subject: {DiscoveryMetricsProvider.metrics()} method
Before optimization: 17 operations per second
After optimization: 8000 operations per second
* Second optimization ({ClusterMetricsImpl} instead of {ClusterMetricsSnapshot})
Subject: {DiscoveryMetricsProvider.metrics().getTotalCpus()}
Before optimization: 8000 operations per second
After optimization: 2000000 operations per second
But individual call to {getTotalCpus()} is relatively rare, in most cases 
{DiscoveryMetricsProvider.metrics()} used for sending 
{TcpDiscoveryMetricsUpdateMessage} and performance of 
{ClusterMetricsSnapshot.serialize(DiscoveryMetricsProvider.metrics())} left the 
same (8000 operations per second). Perhaps after the first optimization this 
(second) optimization is no longer needed?
* Third optimization (one iteration over partitions)
Subject: {DiscoveryMetricsProvider.cacheMetrics()}
Before optimization: 17 operations per second
After optimization: 75 operations per second



was (Author: alex_pl):
There were 4 fixes made:

# Each cluster metrics calculation invoked the 
DiscoveryMetricsProvider#cacheMetrics method, which creates metrics snapshots 
for all caches and is heavy, when a large number of caches are started. But 
only one cache metric is needed for cluster metrics calculation 
(getOffHeapAllocatedSize), so invocation of cacheMetrics is replaced with a 
direct call to getOffHeapAllocatedSize.
# Each invocation of DiscoveryMetricsProvider#metrics creates cluster metrics 
snapshot. It’s ok for discovery update message (because all metrics are 
serialized and calculation of all metrics is needed), but sometimes this is 
redundant. For example, call to locNode.metrics().getTotalCpus() calculates 
each cluster metric, but uses only a total CPU count, which can be calculated 
at a lower cost. To solve this problem DiscoveryMetricsProvider#metrics method 
does not calculate snapshot of all metrics now, but returns a 
ClusterMetricsImpl class instance, which calculates each metric on demand.
# Some cache metrics (entities count, partitions count) iterates through local 
partitions to get the result. There are 9 such methods which uses this 
iteration per each call. But it is possible to calculate this 9 metrics using 
one iteration over partitions. The new method was implemented to calculate all 
this metrics at once, in CacheMetricsSnapshot individual calculations of this 
metrics were replaced by the new method.
# If there are a lot of nodes and a lot of caches in the cluster, size of 
TcpDiscoveryMetricsUpdateMessage can be rather large (up to hundreds of 
megabytes), because it contains information about each metric for each cache on 
each node. Sending cache metrics in the discovery message can be turned off by 
disabling statistics for caches, but it also makes unavailable cache metrics 
via JMX. New system property was added to disable cache metrics update in the 
discovery message without disabling statistics.

Some benchmarks:

Environment: 2 nodes, 200 caches (with statistics enabled), 1024 partitions per 
cache, 10000 job metrics snapshots.

* First optimization (direct call to getOffHeapAllocatedSize)
Subject: DiscoveryMetricsProvider.metrics() method
Before optimization: 17 operations per second
After optimization: 8000 operations per second
* Second optimization (ClusterMetricsImpl instead of ClusterMetricsSnapshot)
Subject: DiscoveryMetricsProvider.metrics().getTotalCpus()
Before optimization: 8000 operations per second
After optimization: 2000000 operations per second
But individual call to getTotalCpus() is relatively rare, in most cases 
DiscoveryMetricsProvider.metrics() used for sending 
TcpDiscoveryMetricsUpdateMessage and performance of 
ClusterMetricsSnapshot.serialize(DiscoveryMetricsProvider.metrics()) left the 
same (8000 operations per second). Perhaps after the first optimization this 
optimization is no longer needed?
* Third optimization (one iteration over partitions)
Subject: DiscoveryMetricsProvider.cacheMetrics()
Before optimization: 17 operations per second
After optimization: 75 operations per second


> Cache metrics are updated in timeout-worker potentially delaying critical 
> code execution due to current implementation issues.
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-6923
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6923
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 2.3
>            Reporter: Alexei Scherbakov
>            Assignee: Aleksey Plekhanov
>              Labels: iep-6
>             Fix For: 2.4
>
>
> Some metrics are using cache iteration for calculation. If number of caches 
> rather large this can be slow.
> Similar code is running in discovery thread.
> See stack trace for example.
> {noformat}
> "grid-timeout-worker-#39%DPL_GRID%DplGridNodeName%" #152 prio=5 os_prio=0 
> tid=0x00007f1009a03000 nid=0x5caa runnable [0x00007f0f059d9000] 
>    java.lang.Thread.State: RUNNABLE 
>         at java.util.HashMap.containsKey(HashMap.java:595) 
>         at java.util.HashSet.contains(HashSet.java:203) 
>         at 
> java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) 
>         at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.apply(IgniteCacheOffheapManagerImpl.java:339)
>         at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.apply(IgniteCacheOffheapManagerImpl.java:337)
>         at 
> org.apache.ignite.internal.util.lang.gridfunc.TransformFilteringIterator.hasNext:@TransformFilteringIterator.java:90)
>         at 
> org.apache.ignite.internal.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45)
>  
>         at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.cacheEntriesCount(IgniteCacheOffheapManagerImpl.java:293)
>         at 
> org.apache.ignite.internal.processors.cache.CacheMetricsImpl.getOffHeapPrimaryEntriesCount(CacheMetricsImpl.java:240)
>         at 
> org.apache.ignite.internal.processors.cache.CacheMetricsSnapshot.<init>(CacheMetricsSnapshot.java:271)
>  
>         at 
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.localMetrics(GridCacheAdapter.java:3217)
>  
>         at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.cacheMetrics(GridDiscoveryManager.java:1151)
>         at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.nonHeapMemoryUsed(GridDiscoveryManager.java:1121)
>         at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.metrics(GridDiscoveryManager.java:1087)
>  
>         at 
> org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.metrics(TcpDiscoveryNode.java:269)
>  
>         at 
> org.apache.ignite.internal.IgniteKernal$3.run(IgniteKernal.java:1175) 
>         at 
> org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$CancelableTask.onTimeout(GridTimeoutProcessor.java:256)
>         - locked <0x00007f115f5bf890> (a 
> org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$CancelableTask)
>         at 
> org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:158)
>         at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) 
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (IGNITE-6923) Cache metrics are updated in timeout-worker potentially delaying critical code execution due to current implementation issues.

Reply via email to