[ 
https://issues.apache.org/jira/browse/SOLR-13234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775717#comment-16775717
 ] 

Shalin Shekhar Mangar commented on SOLR-13234:
----------------------------------------------

Thanks Danyal, great stuff!

A few comments:
# In SolrCloudScraper.pingAllCores and metricsForAllHosts methods, any 
exception thrown during the ping (inside the function passed to the 
sendRequestsInParallel method) will cause the http clients to leak. The call to 
closeAll should be in a try-finally clause.
# We should have some sane defaults for connection and read timeout on the 
HttpSolrClient that is created. Even better if we can set a value through the 
configuration.
# Related to the above, it'd be nice to create an HttpClient instance using 
values loaded from the configuration and use it across all the different 
CloudSolrClient and HttpSolrClient instances that are created. That way, the 
http connections can be re-used during ping and fetching metrics.

> Prometheus Metric Exporter Not Threadsafe
> -----------------------------------------
>
>                 Key: SOLR-13234
>                 URL: https://issues.apache.org/jira/browse/SOLR-13234
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: metrics
>    Affects Versions: 7.6, 8.0
>            Reporter: Danyal Prout
>            Assignee: Shalin Shekhar Mangar
>            Priority: Minor
>              Labels: metric-collector
>             Fix For: 8.x, master (9.0)
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Solr Prometheus Exporter collects metrics when it receives a HTTP request 
> from Prometheus. Prometheus sends this request, on its [scrape 
> interval|https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config].
>  When the time taken to collect the Solr metrics is greater than the scrape 
> interval of the Prometheus server, this results in concurrent metric 
> collection occurring in this 
> [method|https://github.com/apache/lucene-solr/blob/master/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/collector/SolrCollector.java#L86].
>  This method doesn’t appear to be thread safe, for instance you could have 
> concurrent modifications of a 
> [map|https://github.com/apache/lucene-solr/blob/master/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/collector/SolrCollector.java#L119].
>  After a while the Solr Exporter processes becomes nondeterministic, we've 
> observed NPE and loss of metrics.
> To address this, I'm proposing the following fixes:
> 1. Read/parse the configuration at startup and make it immutable. 
>  2. Collect metrics from Solr on an interval which is controlled by the Solr 
> Exporter and cache the metric samples to return during Prometheus scraping. 
> Metric collection can be expensive, for example executing arbitrary Solr 
> searches, it's not ideal to allow for concurrent metric collection and on an 
> interval which is not defined by the Solr Exporter.
> There are also a few other performance improvements that we've made while 
> fixing this, for example using the ClusterStateProvider instead of sending 
> multiple HTTP requests to each Solr node to lookup all the cores.
> I'm currently finishing up these changes which I'll submit as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to