[
https://issues.apache.org/jira/browse/SOLR-13234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man reopened SOLR-13234:
-----------------------------
Shalin: several jenkins failures from SolrExporterIntegrationTest.jvmMetrics in
the few days since you've committed this, including this seed that reproduces
reliably for me on master...
{noformat}
[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=SolrExporterIntegrationTest -Dtests.method=jvmMetrics
-Dtests.seed=D0408796D2DB58EC -Dtests.multiplier=3 -Dtests.slow=true
-Dtests.badapples=true -Dtests.locale=tr-TR
-Dtests.timezone=America/Argentina/Mendoza -Dtests.asserts=true
-Dtests.file.encoding=ISO-8859-1
[junit4] FAILURE 3.92s | SolrExporterIntegrationTest.jvmMetrics <<<
[junit4] > Throwable #1: java.lang.AssertionError: expected:<4> but
was:<0>
[junit4] > at
__randomizedtesting.SeedInfo.seed([D0408796D2DB58EC:3F4BA79C7359478]:0)
[junit4] > at
org.apache.solr.prometheus.exporter.SolrExporterIntegrationTest.jvmMetrics(SolrExporterIntegrationTest.java:68)
[junit4] > at java.lang.Thread.run(Thread.java:748)
{noformat}
and this seed which was reported by 8.x jenkins but also reproduces on master...
{noformat}
[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=SolrExporterIntegrationTest -Dtests.method=jvmMetrics
-Dtests.seed=62880F3B9F140C89 -Dtests.multiplier=2 -Dtests.nightly=true
-Dtests.slow=true -Dtests.badapples=true
-Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-8.x/test-data/enwiki.random.lines.txt
-Dtests.locale=tr-TR -Dtests.timezone=Asia/Novokuznetsk -Dtests.asserts=true
-Dtests.file.encoding=US-ASCII
[junit4] FAILURE 3.40s | SolrExporterIntegrationTest.jvmMetrics <<<
[junit4] > Throwable #1: java.lang.AssertionError: expected:<4> but
was:<0>
[junit4] > at
__randomizedtesting.SeedInfo.seed([62880F3B9F140C89:B13C32D48AFAC01D]:0)
[junit4] > at
org.apache.solr.prometheus.exporter.SolrExporterIntegrationTest.jvmMetrics(SolrExporterIntegrationTest.java:68)
[junit4] > at java.lang.Thread.run(Thread.java:748)
{noformat}
> Prometheus Metric Exporter Not Threadsafe
> -----------------------------------------
>
> Key: SOLR-13234
> URL: https://issues.apache.org/jira/browse/SOLR-13234
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: metrics
> Affects Versions: 7.6, 8.0
> Reporter: Danyal Prout
> Assignee: Shalin Shekhar Mangar
> Priority: Minor
> Labels: metric-collector
> Fix For: 8.x, master (9.0)
>
> Attachments: SOLR-13234-branch_7x.patch
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The Solr Prometheus Exporter collects metrics when it receives a HTTP request
> from Prometheus. Prometheus sends this request, on its [scrape
> interval|https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config].
> When the time taken to collect the Solr metrics is greater than the scrape
> interval of the Prometheus server, this results in concurrent metric
> collection occurring in this
> [method|https://github.com/apache/lucene-solr/blob/master/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/collector/SolrCollector.java#L86].
> This method doesn’t appear to be thread safe, for instance you could have
> concurrent modifications of a
> [map|https://github.com/apache/lucene-solr/blob/master/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/collector/SolrCollector.java#L119].
> After a while the Solr Exporter processes becomes nondeterministic, we've
> observed NPE and loss of metrics.
> To address this, I'm proposing the following fixes:
> 1. Read/parse the configuration at startup and make it immutable.
> 2. Collect metrics from Solr on an interval which is controlled by the Solr
> Exporter and cache the metric samples to return during Prometheus scraping.
> Metric collection can be expensive, for example executing arbitrary Solr
> searches, it's not ideal to allow for concurrent metric collection and on an
> interval which is not defined by the Solr Exporter.
> There are also a few other performance improvements that we've made while
> fixing this, for example using the ClusterStateProvider instead of sending
> multiple HTTP requests to each Solr node to lookup all the cores.
> I'm currently finishing up these changes which I'll submit as a PR.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]