This is an automated email from the ASF dual-hosted git repository. wusheng pushed a commit to branch id-read-optimization in repository https://gitbox.apache.org/repos/asf/skywalking.git
commit 6fff7f3cff88032eef5648b3a77c578b47286678 Author: Wu Sheng <[email protected]> AuthorDate: Tue Jun 29 10:03:30 2021 +0800 Optimize IDs reading in the persistent worker. --- CHANGES.md | 38 ++++++++++++++-------- .../analysis/worker/MetricsPersistentWorker.java | 28 ++++++++++++++-- 2 files changed, 49 insertions(+), 17 deletions(-) diff --git a/CHANGES.md b/CHANGES.md index 1fbb79b..c4a5cdd 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -4,39 +4,45 @@ Release Notes. 8.7.0 ------------------ + #### Project + * Extract dependency management to a bom. * Add JDK 16 to test matrix. #### Java Agent + * Supports modifying span attributes in async mode. * Agent supports the collection of JVM arguments and jar dependency information. -* [Temporary] Support authentication for log report channel. This feature and grpc channel is going to be removed after Satellite 0.2.0 release. -* Remove deprecated gRPC method, `io.grpc.ManagedChannelBuilder#nameResolverFactory`. See [gRPC-java 7133](https://github.com/grpc/grpc-java/issues/7133) for more details. +* [Temporary] Support authentication for log report channel. This feature and grpc channel is going to be removed after + Satellite 0.2.0 release. +* Remove deprecated gRPC method, `io.grpc.ManagedChannelBuilder#nameResolverFactory`. + See [gRPC-java 7133](https://github.com/grpc/grpc-java/issues/7133) for more details. * Add `Neo4j-4.x` plugin. * Correct `profile.duration` to `profile.max_duration` in the default `agent.config` file. -* Fix the reponse time of gRPC. +* Fix the response time of gRPC. #### OAP-Backend + * Disable Spring sleuth meter analyzer by default. * Only count 5xx as error in Envoy ALS receiver. * Upgrade apollo core caused by CVE-2020-15170. * Upgrade kubernetes client caused by CVE-2020-28052. * Upgrade Elasticsearch 7 client caused by CVE-2020-7014. -* Upgrade jackson related libs caused by CVE-2018-11307, CVE-2018-14718 ~ CVE-2018-14721, CVE-2018-19360 ~ CVE-2018-19362, - CVE-2019-14379, CVE-2019-14540, CVE-2019-14892, CVE-2019-14893, CVE-2019-16335, CVE-2019-16942, CVE-2019-16943, - CVE-2019-17267, CVE-2019-17531, CVE-2019-20330, CVE-2020-8840, CVE-2020-9546, CVE-2020-9547, CVE-2020-9548, - CVE-2018-12022, CVE-2018-12023, CVE-2019-12086, CVE-2019-14439, CVE-2020-10672, CVE-2020-10673, CVE-2020-10968, - CVE-2020-10969, CVE-2020-11111, CVE-2020-11112, CVE-2020-11113, CVE-2020-11619, CVE-2020-11620, CVE-2020-14060, - CVE-2020-14061, CVE-2020-14062, CVE-2020-14195, CVE-2020-24616, CVE-2020-24750, CVE-2020-25649, CVE-2020-35490, - CVE-2020-35491, CVE-2020-35728 and CVE-2020-36179 ~ CVE-2020-36190. +* Upgrade jackson related libs caused by CVE-2018-11307, CVE-2018-14718 ~ CVE-2018-14721, CVE-2018-19360 ~ + CVE-2018-19362, CVE-2019-14379, CVE-2019-14540, CVE-2019-14892, CVE-2019-14893, CVE-2019-16335, CVE-2019-16942, + CVE-2019-16943, CVE-2019-17267, CVE-2019-17531, CVE-2019-20330, CVE-2020-8840, CVE-2020-9546, CVE-2020-9547, + CVE-2020-9548, CVE-2018-12022, CVE-2018-12023, CVE-2019-12086, CVE-2019-14439, CVE-2020-10672, CVE-2020-10673, + CVE-2020-10968, CVE-2020-10969, CVE-2020-11111, CVE-2020-11112, CVE-2020-11113, CVE-2020-11619, CVE-2020-11620, + CVE-2020-14060, CVE-2020-14061, CVE-2020-14062, CVE-2020-14195, CVE-2020-24616, CVE-2020-24750, CVE-2020-25649, + CVE-2020-35490, CVE-2020-35491, CVE-2020-35728 and CVE-2020-36179 ~ CVE-2020-36190. * Exclude log4j 1.x caused by CVE-2019-17571. * Upgrade log4j 2.x caused by CVE-2020-9488. * Upgrade nacos libs caused by CVE-2021-29441 and CVE-2021-29442. -* Upgrade netty caused by CVE-2019-20444, CVE-2019-20445, CVE-2019-16869, CVE-2020-11612, CVE-2021-21290, CVE-2021-21295 - and CVE-2021-21409. +* Upgrade netty caused by CVE-2019-20444, CVE-2019-20445, CVE-2019-16869, CVE-2020-11612, CVE-2021-21290, CVE-2021-21295 + and CVE-2021-21409. * Upgrade consul client caused by CVE-2018-1000844, CVE-2018-1000850. -* Upgrade zookeeper caused by CVE-2019-0201. +* Upgrade zookeeper caused by CVE-2019-0201. * Upgrade snake yaml caused by CVE-2017-18640. * Upgrade embed tomcat caused by CVE-2020-13935. * Upgrade commons-lang3 to avoid potential NPE in some JDK versions. @@ -45,8 +51,13 @@ Release Notes. * Fix CounterWindow increase computing issue. * Performance: optimize Envoy ALS analyzer performance in high traffic load scenario (reduce ~1cpu in ~10k RPS). * Performance: trim useless metadata fields in Envoy ALS metadata to improve performance. +* Performance: enhance persistent session mechanism, by removing cache reloading for minute-level metrics. Reduce 30% + ElasticSearch ID-read traffic, tradeoff by tolerating metrics inaccurate when the cluster scales out and down. +* Performance: enhance persistent session mechanism, about differentiating cache timeout for different dimensionality + metrics. The timeout of the cache for minute and hour level metrics has been prolonged to ~5 min. #### UI + * Fix the date component for log conditions. * Fix selector keys for duplicate options. * Add Python celery plugin. @@ -55,7 +66,6 @@ Release Notes. #### Documentation - All issues and pull requests are [here](https://github.com/apache/skywalking/milestone/90?closed=1) ------------------ diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java index 5195ac6..2ccdf12 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java @@ -51,6 +51,11 @@ import org.apache.skywalking.oap.server.telemetry.api.MetricsTag; */ @Slf4j public class MetricsPersistentWorker extends PersistenceWorker<Metrics> { + /** + * The counter of MetricsPersistentWorker instance, to calculate session timeout offset. + */ + private static long sessionTimeoutOffsetCounter = 0; + private final Model model; private final Map<Metrics, Metrics> context; private final IMetricsDAO metricsDAO; @@ -60,7 +65,9 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> { private final Optional<MetricsTransWorker> transWorker; private final boolean enableDatabaseSession; private final boolean supportUpdate; + private boolean isDownSampling; private CounterMetrics aggregationCounter; + private long sessionTimeout = 70_000; // Unit, ms. 70,000ms means more than one minute. MetricsPersistentWorker(ModuleDefineHolder moduleDefineHolder, Model model, IMetricsDAO metricsDAO, AbstractWorker<Metrics> nextAlarmWorker, AbstractWorker<ExportEvent> nextExportWorker, @@ -74,6 +81,7 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> { this.nextExportWorker = Optional.ofNullable(nextExportWorker); this.transWorker = Optional.ofNullable(transWorker); this.supportUpdate = supportUpdate; + this.isDownSampling = false; String name = "METRICS_L2_AGGREGATION"; int size = BulkConsumePool.Creator.recommendMaxSize() / 8; @@ -98,10 +106,11 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> { new MetricsTag.Keys("metricName", "level", "dimensionality"), new MetricsTag.Values(model.getName(), "2", model.getDownsampling().getName()) ); + sessionTimeoutOffsetCounter++; } /** - * Create the leaf MetricsPersistentWorker, no next step. + * Create the leaf and down-sampling MetricsPersistentWorker, no next step. */ MetricsPersistentWorker(ModuleDefineHolder moduleDefineHolder, Model model, IMetricsDAO metricsDAO, boolean enableDatabaseSession, boolean supportUpdate) { @@ -109,6 +118,11 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> { null, null, null, enableDatabaseSession, supportUpdate ); + this.isDownSampling = true; + // For a down-sampling metrics, we prolong the session timeout for 4 times, nearly 5 minutes. + // And add offset according to worker creation sequence, to avoid context clear overlap, + // eventually optimize load of IDs reading. + this.sessionTimeout = sessionTimeout * 4 + sessionTimeoutOffsetCounter * 200; } /** @@ -217,6 +231,14 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> { return; } + // If session is activated and this worker is about minute level metrics, + // we could skip `#multiGet` and trust cache, + // because in worst case, we override one time bucket metrics due to dirty-write in the cluster re-balancing case. + // In down-sampling cases(hour/day), the cache would be clear periodically to keep memory safe, + // then have to reload(multiGet) metrics from database. + if (enableDatabaseSession && !isDownSampling) { + return; + } final List<Metrics> dbMetrics = metricsDAO.multiGet(model, noInCacheMetrics); if (!enableDatabaseSession) { // Clear the cache only after results from DB are returned successfully. @@ -235,8 +257,8 @@ public class MetricsPersistentWorker extends PersistenceWorker<Metrics> { while (iterator.hasNext()) { Metrics metrics = iterator.next(); metrics.extendSurvivalTime(tookTime); - // 70,000ms means more than one minute. - if (metrics.getSurvivalTime() > 70000) { + + if (metrics.getSurvivalTime() > sessionTimeout) { iterator.remove(); } }
