Copilot commented on code in PR #4210: URL: https://github.com/apache/solr/pull/4210#discussion_r2927056323
########## solr/monitoring/dev/docker-compose.yml: ########## @@ -0,0 +1,125 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# ============================================================================ +# DEVELOPER CONVENIENCE ONLY — NOT FOR PRODUCTION USE +# ============================================================================ +# +# This docker-compose stack is provided solely to make it easy for contributors +# to view and test changes to the two monitoring artifacts: +# +# ../grafana-solr-dashboard.json — Grafana dashboard (main artifact) +# ../prometheus-solr-alerts.yml — Prometheus alert rules (main artifact) +# +# It is NOT intended as a reference deployment or a recommended production setup. +# For integrating these artifacts into your own Prometheus + Grafana installation, +# see the Solr Reference Guide (monitoring-with-prometheus-and-grafana.adoc). +# +# Usage: +# cd solr/monitoring/dev +# docker-compose up -d +# +# Services: +# solr1 http://localhost:8983 (SolrCloud node 1, embedded ZooKeeper) +# solr2 http://localhost:8984 (SolrCloud node 2) +# Prometheus http://localhost:9090 +# Grafana http://localhost:3000 (admin / admin) +# Alertmanager http://localhost:9093 +# trafficgen (no port — continuously indexes/searches the 'test' collection) +# +# Prometheus may show scrape errors for the first 30-60 seconds while Solr starts. +# ============================================================================ + +services: + solr1: + image: solr:10 + ports: + - "8983:8983" + - "9983:9983" # embedded ZooKeeper (Solr 10 default: ZK_PORT = SOLR_PORT + 1000) + # No ZK_HOST set: Solr 10 defaults to SolrCloud with embedded ZooKeeper on port 9983. + healthcheck: + test: ["CMD-SHELL", "curl -sf http://localhost:8983/solr/admin/info/system || exit 1"] + interval: 10s + timeout: 5s + retries: 12 + start_period: 40s + + solr2: + image: solr:10 + ports: + - "8984:8983" + environment: + ZK_HOST: "solr1:9983" + depends_on: + solr1: + condition: service_healthy + healthcheck: + test: ["CMD-SHELL", "curl -sf http://localhost:8983/solr/admin/info/system || exit 1"] + interval: 10s + timeout: 5s + retries: 12 + start_period: 30s + + prometheus: + image: prom/prometheus:latest + ports: + - "9090:9090" + volumes: + - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro + - ../prometheus-solr-alerts.yml:/etc/prometheus/rules/solr-alerts.yml:ro + command: + - "--config.file=/etc/prometheus/prometheus.yml" + - "--storage.tsdb.retention.time=7d" + - "--web.enable-lifecycle" + depends_on: + solr1: + condition: service_healthy + solr2: + condition: service_healthy + restart: unless-stopped + + grafana: + image: grafana/grafana:latest + ports: + - "3000:3000" + environment: + GF_SECURITY_ADMIN_USER: admin + GF_SECURITY_ADMIN_PASSWORD: admin + GF_USERS_ALLOW_SIGN_UP: "false" + volumes: + - ./grafana/provisioning:/etc/grafana/provisioning:ro + - ../grafana-solr-dashboard.json:/var/lib/grafana/dashboards/grafana-solr-dashboard.json:ro + depends_on: + - prometheus + restart: unless-stopped + + alertmanager: + image: prom/alertmanager:latest + ports: Review Comment: Even though this stack is marked “dev only”, using `:latest` for Prometheus/Grafana/Alertmanager makes local testing non-reproducible and can break unexpectedly when upstream images change. Consider pinning these images to specific versions (or at least documenting the versions the dashboard/alerts were validated against). ########## solr/monitoring/mixin/config.libsonnet: ########## @@ -0,0 +1,71 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +{ + // _config holds all tunable parameters for the Solr monitoring mixin. + // Override any value by creating a local config object that extends this one. + _config:: { + + // ----------------------------------------------------------------------- + // Label selectors + // ----------------------------------------------------------------------- + + // Prometheus job selector used in alert expressions (set to match your Review Comment: `solrSelector` is documented here as being used in alert expressions, but the current alert rules in `alerts/alerts.libsonnet` don’t apply it (and several expressions use generic `jvm_*` metrics). Either update the alert rules to actually use `solrSelector` or adjust this comment to avoid misleading future maintainers. ```suggestion // Prometheus job selector for Solr metrics (set to match your ``` ########## solr/monitoring/README.md: ########## @@ -0,0 +1,249 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Solr 10.x Monitoring Artifacts + +This directory provides two ready-to-use monitoring artifacts for Solr 10.x that you can +drop into your existing Prometheus + Grafana installation: + +| File | Description | +|---|---| +| **`grafana-solr-dashboard.json`** | Grafana dashboard — import directly into Grafana | +| **`prometheus-solr-alerts.yml`** | Prometheus alert rules — reference from your `prometheus.yml` | + +These are the **main artifacts**. Everything else in this directory supports their creation +or testing. + +| File / Directory | Description | +|---|---| +| `otel-collector-solr.yml` | OTel Collector config snippet for the OTLP push path | +| `mixin/` | Jsonnet source (single source of truth used to regenerate the artifacts above) | +| `dev/` | **Developer convenience only** — docker-compose stack for testing changes to the artifacts | + +For the full integration guide see the Solr Reference Guide: +**xref:monitoring-with-prometheus-and-grafana.adoc** (deployment-guide module) Review Comment: This README uses an Antora-style `xref:` link, which won’t resolve when viewing the file on GitHub (or in the source release as plain Markdown). Consider replacing it with a normal relative/absolute URL to the rendered ref guide page (and/or to the adoc source path) so the link works in typical Markdown renderers. ```suggestion [Monitoring with Prometheus and Grafana](https://solr.apache.org/guide/monitoring-with-prometheus-and-grafana.html) (deployment-guide module) ``` ########## solr/solr-ref-guide/modules/deployment-guide/pages/monitoring-with-prometheus-and-grafana.adoc: ########## @@ -0,0 +1,181 @@ += Monitoring Solr with Prometheus and Grafana +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +Solr ships two ready-to-use monitoring artifacts in `solr/monitoring/` (source release only): + +* `grafana-solr-dashboard.json` — a Grafana dashboard covering the most important Solr operational metrics +* `prometheus-solr-alerts.yml` — a set of Prometheus alert rules for common failure conditions + +Import these into your own Prometheus + Grafana installation. +Solr exposes all metrics natively at the `/api/metrics` endpoint; no additional exporter process is required. +For the full list of available metrics see xref:metrics-reporting.adoc[]. + +.Solr 10.x monitoring architecture — Prometheus pull path and OTLP push path +image::monitoring-with-prometheus-and-grafana/solr-monitoring-diagram.png[Solr monitoring architecture diagram,800] + +== The Grafana Dashboard + +The "Solr 10.x Overview" dashboard provides five sections covering node health, JVM, +SolrCloud cluster state, index health, and cache efficiency. + +// TODO: Replace with a screenshot of the new Solr 10.x dashboard. +.Solr 10.x Overview dashboard (placeholder screenshot) +image::monitoring-with-prometheus-and-grafana/grafana-solr-dashboard.png[Solr Grafana dashboard screenshot,800] Review Comment: This refguide page still contains a TODO and explicitly labels the dashboard screenshot as a placeholder. If this page is intended to ship, please replace it with the real Solr 10.x dashboard screenshot (or remove the placeholder/TODO) so the published docs don't include unfinished content. ########## solr/monitoring/prometheus-solr-alerts.yml: ########## @@ -0,0 +1,133 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# This file is generated by mixin/alerts/alerts.libsonnet — do not edit directly. +# To regenerate: cd mixin && make alerts +groups: +- name: SolrAlerts + rules: + - alert: SolrHighHeapUsage + annotations: + description: | + Instance {{ $labels.instance }} is using {{ $value | humanizePercentage }} of its + maximum JVM heap (threshold: 90%). + High heap usage increases GC pressure and risks OutOfMemoryError. + Consider increasing -Xmx or reducing cache sizes in solrconfig.xml. + summary: Solr instance {{ $labels.instance }} has high JVM heap usage + expr: | + max by (instance) (jvm_memory_used_bytes{jvm_memory_type="heap"}) + / + max by (instance) (jvm_memory_limit_bytes{jvm_memory_type="heap"}) + > 0.90000000000000002 + for: 2m + labels: + severity: critical + - alert: SolrJvmGcThrashing + annotations: + description: | + Instance {{ $labels.instance }} is spending {{ $value | humanizeDuration }} per second + in garbage collection (threshold: 10 seconds/minute). + GC thrashing causes stop-the-world pauses and severely degrades search latency. + Check heap usage (SolrHighHeapUsage), consider tuning GC or increasing heap. + summary: Solr instance {{ $labels.instance }} is experiencing GC thrashing + expr: | + sum by (instance) (rate(jvm_gc_duration_seconds_sum{instance=~".+"}[1m])) + > 10 + for: 3m + labels: + severity: critical + - alert: SolrLowDiskSpace + annotations: + description: | + Instance {{ $labels.instance }} has only {{ $value | humanizePercentage }} disk space + free (threshold: 15%). + Solr will stop accepting index updates when disk space is exhausted. + Free up space or expand the disk immediately. + summary: Solr instance {{ $labels.instance }} is low on disk space + expr: | + min by (instance) (solr_disk_space_megabytes{type="usable_space"}) + / + min by (instance) (solr_disk_space_megabytes{type="total_space"}) + < 0.14999999999999999 + for: 5m + labels: + severity: critical + - alert: SolrHighSearchLatency + annotations: + description: | + Instance {{ $labels.instance }} p99 search latency is {{ $value | humanizeDuration }}ms + (threshold: 1000ms). + Possible causes: large result sets, expensive faceting, insufficient cache, or GC pauses. + Check the Search Latency panel in the Grafana dashboard for trends. + summary: Solr instance {{ $labels.instance }} has high search latency + expr: | + histogram_quantile(0.99, + sum by (le, instance) ( + rate(solr_core_requests_times_milliseconds_bucket{handler=~"/select.*",internal="false"}[5m]) + ) + ) > 1000 + for: 5m + labels: + severity: warning + - alert: SolrHighErrorRate + annotations: + description: | + Instance {{ $labels.instance }} error rate is {{ $value | humanizePercentage }} + (threshold: 1%). + Check Solr logs for the root cause. Common causes: invalid queries, missing fields, + core loading failures, or network connectivity issues. + summary: Solr instance {{ $labels.instance }} has a high error rate + expr: | + sum by (instance) (rate(solr_node_requests_total{category=~"(?i)error"}[5m])) + / + sum by (instance) (rate(solr_node_requests_total[5m])) + > 0.01 + for: 5m + labels: + severity: warning + - alert: SolrOverseerQueueBuildup + annotations: + description: | + The Overseer collection work queue has {{ $value }} pending operations on + instance {{ $labels.instance }} (threshold: 50). + A growing queue indicates the Overseer is falling behind; check for long-running + collection operations, Overseer leader election issues, or ZooKeeper latency. + summary: Solr Overseer collection work queue is building up on {{ $labels.instance }} + expr: | + sum by (instance) (solr_overseer_collection_work_queue_size) + > 50 + for: 5m + labels: + severity: warning + - alert: SolrHighMmapRatio + annotations: + description: | + The Solr index is using {{ $value | humanize }}% of available MMap address space + (physical RAM minus JVM heap), threshold: 85%. + When the index exceeds available MMap memory, Lucene falls back to I/O reads, + significantly degrading search performance. Consider adding RAM, reducing index size, + or increasing the JVM heap ratio. + summary: Solr index is using a high fraction of available MMap memory + expr: | + sum(solr_core_index_size_megabytes) + / + ( + (max(jvm_system_memory_total_bytes) - sum(jvm_memory_limit_bytes{jvm_memory_type="heap"})) Review Comment: The generated alert rule for `SolrHighMmapRatio` currently aggregates index size and memory across all instances (no per-`instance` grouping), which means it won’t fire/resolve per node and can hide a single-node hot spot. Since this file is generated, please fix the expression in the mixin source and regenerate so this YAML ends up computing the ratio per instance (and ideally scoping JVM metrics to Solr via `job="solr"` or similar). ```suggestion sum by (instance) (solr_core_index_size_megabytes) / ( ( max by (instance) (jvm_system_memory_total_bytes{job="solr"}) - sum by (instance) (jvm_memory_limit_bytes{jvm_memory_type="heap", job="solr"}) ) ``` ########## solr/monitoring/mixin/alerts/alerts.libsonnet: ########## @@ -0,0 +1,225 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// alerts.libsonnet — Prometheus alert rules for Solr 10.x. +// +// Seven rules in group "SolrAlerts": +// Critical (3): SolrHighHeapUsage, SolrJvmGcThrashing, SolrLowDiskSpace +// Warning (4): SolrHighSearchLatency, SolrHighErrorRate, SolrOverseerQueueBuildup, SolrHighMmapRatio +// +// Thresholds are defined in config.libsonnet and can be overridden. +// All expressions include "by (instance)" so alerts fire and resolve per-node. + +local config = import '../config.libsonnet'; +local cfg = config._config; + +{ + groups: [ + { + name: 'SolrAlerts', + rules: [ + + // --------------------------------------------------------------- + // CRITICAL: SolrHighHeapUsage + // Fires when a single JVM instance uses > 90% of its max heap for 2 minutes. + // Uses max by (instance) to deduplicate the dual OTel JVM scopes. + // --------------------------------------------------------------- + { + alert: 'SolrHighHeapUsage', + expr: ||| + max by (instance) (jvm_memory_used_bytes{jvm_memory_type="heap"}) + / + max by (instance) (jvm_memory_limit_bytes{jvm_memory_type="heap"}) + > %(threshold)s + ||| % { threshold: cfg.heapUsageThreshold }, + 'for': '2m', + labels: { severity: 'critical' }, + annotations: { + summary: 'Solr instance {{ $labels.instance }} has high JVM heap usage', + description: ||| + Instance {{ $labels.instance }} is using {{ $value | humanizePercentage }} of its + maximum JVM heap (threshold: %(threshold)s%%). + High heap usage increases GC pressure and risks OutOfMemoryError. + Consider increasing -Xmx or reducing cache sizes in solrconfig.xml. + ||| % { threshold: std.floor(cfg.heapUsageThreshold * 100) }, + }, + }, + + // --------------------------------------------------------------- + // CRITICAL: SolrJvmGcThrashing + // Fires when the sum of GC wall-clock time across all collectors exceeds + // cfg.gcThrashThresholdSecsPerMin seconds per minute for 3 minutes. + // --------------------------------------------------------------- + { + alert: 'SolrJvmGcThrashing', + expr: ||| + sum by (instance) (rate(jvm_gc_duration_seconds_sum{instance=~".+"}[1m])) + > %(threshold)s + ||| % { threshold: cfg.gcThrashThresholdSecsPerMin }, + 'for': '3m', + labels: { severity: 'critical' }, + annotations: { + summary: 'Solr instance {{ $labels.instance }} is experiencing GC thrashing', + description: ||| + Instance {{ $labels.instance }} is spending {{ $value | humanizeDuration }} per second + in garbage collection (threshold: %(threshold)s seconds/minute). + GC thrashing causes stop-the-world pauses and severely degrades search latency. + Check heap usage (SolrHighHeapUsage), consider tuning GC or increasing heap. + ||| % { threshold: cfg.gcThrashThresholdSecsPerMin }, + }, + }, + + // --------------------------------------------------------------- + // CRITICAL: SolrLowDiskSpace + // Fires when free disk space drops below 15% of total for 5 minutes. + // Uses min/by(instance) so the alert fires per-node independently. + // --------------------------------------------------------------- + { + alert: 'SolrLowDiskSpace', + expr: ||| + min by (instance) (solr_disk_space_megabytes{type="usable_space"}) + / + min by (instance) (solr_disk_space_megabytes{type="total_space"}) + < %(threshold)s + ||| % { threshold: cfg.diskFreeThreshold }, + 'for': '5m', + labels: { severity: 'critical' }, + annotations: { + summary: 'Solr instance {{ $labels.instance }} is low on disk space', + description: ||| + Instance {{ $labels.instance }} has only {{ $value | humanizePercentage }} disk space + free (threshold: %(threshold)s%%). + Solr will stop accepting index updates when disk space is exhausted. + Free up space or expand the disk immediately. + ||| % { threshold: std.floor(cfg.diskFreeThreshold * 100) }, + }, + }, + + // --------------------------------------------------------------- + // WARNING: SolrHighSearchLatency + // Fires when p99 search latency exceeds 1000ms for /select handlers for 5 minutes. + // --------------------------------------------------------------- + { + alert: 'SolrHighSearchLatency', + expr: ||| + histogram_quantile(0.99, + sum by (le, instance) ( + rate(solr_core_requests_times_milliseconds_bucket{handler=~"/select.*",internal="false"}[5m]) + ) + ) > %(threshold)s + ||| % { threshold: cfg.searchLatencyThresholdMs }, + 'for': '5m', + labels: { severity: 'warning' }, + annotations: { + summary: 'Solr instance {{ $labels.instance }} has high search latency', + description: ||| + Instance {{ $labels.instance }} p99 search latency is {{ $value | humanizeDuration }}ms + (threshold: %(threshold)sms). + Possible causes: large result sets, expensive faceting, insufficient cache, or GC pauses. + Check the Search Latency panel in the Grafana dashboard for trends. + ||| % { threshold: cfg.searchLatencyThresholdMs }, + }, + }, + + // --------------------------------------------------------------- + // WARNING: SolrHighErrorRate + // Fires when error requests exceed 1% of total requests for 5 minutes. + // + // NOTE: Verify the category label value for errors on your Solr 10.x instance. + // The regex "(?i)error" is a best-effort match. If solr_node_requests_total + // does not carry an error category, consider a handler-based proxy metric or + // check solr_core_requests_total with status=error (if available). + // --------------------------------------------------------------- + { + alert: 'SolrHighErrorRate', + expr: ||| + sum by (instance) (rate(solr_node_requests_total{category=~"(?i)error"}[5m])) + / + sum by (instance) (rate(solr_node_requests_total[5m])) + > %(threshold)s + ||| % { threshold: cfg.errorRateThreshold }, + 'for': '5m', + labels: { severity: 'warning' }, + annotations: { + summary: 'Solr instance {{ $labels.instance }} has a high error rate', + description: ||| + Instance {{ $labels.instance }} error rate is {{ $value | humanizePercentage }} + (threshold: %(threshold)s%%). + Check Solr logs for the root cause. Common causes: invalid queries, missing fields, + core loading failures, or network connectivity issues. + ||| % { threshold: std.floor(cfg.errorRateThreshold * 100) }, + }, + }, + + // --------------------------------------------------------------- + // WARNING: SolrOverseerQueueBuildup + // Fires when the Overseer collection work queue exceeds 50 items for 5 minutes. + // This metric is only emitted in SolrCloud mode; alert does not fire in standalone. + // --------------------------------------------------------------- + { + alert: 'SolrOverseerQueueBuildup', + expr: ||| + sum by (instance) (solr_overseer_collection_work_queue_size) + > %(threshold)s + ||| % { threshold: cfg.overseerQueueThreshold }, + 'for': '5m', + labels: { severity: 'warning' }, + annotations: { + summary: 'Solr Overseer collection work queue is building up on {{ $labels.instance }}', + description: ||| + The Overseer collection work queue has {{ $value }} pending operations on + instance {{ $labels.instance }} (threshold: %(threshold)s). + A growing queue indicates the Overseer is falling behind; check for long-running + collection operations, Overseer leader election issues, or ZooKeeper latency. + ||| % { threshold: cfg.overseerQueueThreshold }, + }, + }, + + // --------------------------------------------------------------- + // WARNING: SolrHighMmapRatio + // Fires when index size exceeds 85% of available MMap memory (RAM minus heap). + // No absent() guard needed: when jvm_system_memory_total_bytes is absent, the + // expression produces no data and the alert naturally does not fire. + // This alert requires jvm_system_memory_total_bytes (Solr 10.x physical RAM metric). + // --------------------------------------------------------------- + { + alert: 'SolrHighMmapRatio', + expr: ||| + sum(solr_core_index_size_megabytes) + / + ( + (max(jvm_system_memory_total_bytes) - sum(jvm_memory_limit_bytes{jvm_memory_type="heap"})) + / 1e6 + ) + * 100 > %(threshold)s + ||| % { threshold: cfg.mmapRatioThreshold }, Review Comment: `SolrHighMmapRatio` is described (in this file’s header) as firing per-instance, but this expression aggregates across *all* instances (no `by (instance)` on the numerator/denominator). That makes it a cluster-wide ratio and can misattribute risk (and also drops the `instance` label entirely). Consider rewriting it to compute the ratio per instance (and include the Solr selector), so the alert fires/resolves independently per node. ########## solr/monitoring/mixin/alerts/alerts.libsonnet: ########## @@ -0,0 +1,225 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// alerts.libsonnet — Prometheus alert rules for Solr 10.x. +// +// Seven rules in group "SolrAlerts": +// Critical (3): SolrHighHeapUsage, SolrJvmGcThrashing, SolrLowDiskSpace +// Warning (4): SolrHighSearchLatency, SolrHighErrorRate, SolrOverseerQueueBuildup, SolrHighMmapRatio +// +// Thresholds are defined in config.libsonnet and can be overridden. +// All expressions include "by (instance)" so alerts fire and resolve per-node. + +local config = import '../config.libsonnet'; +local cfg = config._config; + +{ + groups: [ + { + name: 'SolrAlerts', + rules: [ + + // --------------------------------------------------------------- + // CRITICAL: SolrHighHeapUsage + // Fires when a single JVM instance uses > 90% of its max heap for 2 minutes. + // Uses max by (instance) to deduplicate the dual OTel JVM scopes. + // --------------------------------------------------------------- + { + alert: 'SolrHighHeapUsage', + expr: ||| + max by (instance) (jvm_memory_used_bytes{jvm_memory_type="heap"}) + / + max by (instance) (jvm_memory_limit_bytes{jvm_memory_type="heap"}) + > %(threshold)s + ||| % { threshold: cfg.heapUsageThreshold }, + 'for': '2m', + labels: { severity: 'critical' }, + annotations: { + summary: 'Solr instance {{ $labels.instance }} has high JVM heap usage', + description: ||| + Instance {{ $labels.instance }} is using {{ $value | humanizePercentage }} of its + maximum JVM heap (threshold: %(threshold)s%%). + High heap usage increases GC pressure and risks OutOfMemoryError. + Consider increasing -Xmx or reducing cache sizes in solrconfig.xml. + ||| % { threshold: std.floor(cfg.heapUsageThreshold * 100) }, + }, + }, + + // --------------------------------------------------------------- + // CRITICAL: SolrJvmGcThrashing + // Fires when the sum of GC wall-clock time across all collectors exceeds + // cfg.gcThrashThresholdSecsPerMin seconds per minute for 3 minutes. + // --------------------------------------------------------------- + { + alert: 'SolrJvmGcThrashing', + expr: ||| + sum by (instance) (rate(jvm_gc_duration_seconds_sum{instance=~".+"}[1m])) + > %(threshold)s + ||| % { threshold: cfg.gcThrashThresholdSecsPerMin }, Review Comment: This GC thrashing rule also queries generic `jvm_gc_duration_seconds_sum` without applying a Solr selector (e.g., `job="solr"`). That can trigger alerts for other JVM apps in the same Prometheus. Please scope the metric selector using `cfg.solrSelector` (and ideally use `cfg.instanceLabel` instead of hard-coding `instance`). ```suggestion sum by (%(instanceLabel)s) (rate(jvm_gc_duration_seconds_sum{%(solrSelector)s}[1m])) > %(threshold)s ||| % { threshold: cfg.gcThrashThresholdSecsPerMin, solrSelector: cfg.solrSelector, instanceLabel: cfg.instanceLabel, }, ``` ########## solr/monitoring/mixin/alerts/alerts.libsonnet: ########## @@ -0,0 +1,225 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// alerts.libsonnet — Prometheus alert rules for Solr 10.x. +// +// Seven rules in group "SolrAlerts": +// Critical (3): SolrHighHeapUsage, SolrJvmGcThrashing, SolrLowDiskSpace +// Warning (4): SolrHighSearchLatency, SolrHighErrorRate, SolrOverseerQueueBuildup, SolrHighMmapRatio +// +// Thresholds are defined in config.libsonnet and can be overridden. +// All expressions include "by (instance)" so alerts fire and resolve per-node. + +local config = import '../config.libsonnet'; +local cfg = config._config; + +{ + groups: [ + { + name: 'SolrAlerts', + rules: [ + + // --------------------------------------------------------------- + // CRITICAL: SolrHighHeapUsage + // Fires when a single JVM instance uses > 90% of its max heap for 2 minutes. + // Uses max by (instance) to deduplicate the dual OTel JVM scopes. + // --------------------------------------------------------------- + { + alert: 'SolrHighHeapUsage', + expr: ||| + max by (instance) (jvm_memory_used_bytes{jvm_memory_type="heap"}) + / + max by (instance) (jvm_memory_limit_bytes{jvm_memory_type="heap"}) + > %(threshold)s + ||| % { threshold: cfg.heapUsageThreshold }, Review Comment: Prometheus alert expression uses generic JVM metrics without any Solr-specific selector (e.g., `job="solr"`). As written, this rule can fire for *any* JVM workload scraped by Prometheus (not just Solr). Since `config.libsonnet` defines `solrSelector`, consider adding it to the metric selectors in this rule (and other JVM-metric rules) to scope alerts to Solr only. ########## solr/monitoring/dev/prometheus/prometheus.yml: ########## @@ -0,0 +1,48 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# prometheus.yml — Prometheus configuration for the Solr 10.x monitoring stack. +# +# Used by the docker-compose.yml stack (solr/monitoring/docker-compose.yml). Review Comment: This comment references `solr/monitoring/docker-compose.yml`, but the compose stack in this PR is located at `solr/monitoring/dev/docker-compose.yml`. Updating the path here would prevent confusion for contributors trying to find the referenced file. ```suggestion # Used by the docker-compose.yml stack (solr/monitoring/dev/docker-compose.yml). ``` ########## solr/solr-ref-guide/modules/deployment-guide/pages/metrics-reporting.adoc: ########## @@ -146,6 +146,12 @@ See https://prometheus.io/docs/concepts/data_model/[Prometheus Data Model] docum This endpoint can be used to pull/scrape metrics to a Prometheus server or any Prometheus-compatible backend directly from Solr. +[NOTE] +==== +For a complete Prometheus + Grafana + Alertmanager setup including a pre-built dashboard and alert rules, see xref:monitoring-with-prometheus-and-grafana.adoc[]. +A docker-compose evaluation stack is also available in `solr/monitoring/` in the source release. +==== Review Comment: This note says the docker-compose evaluation stack is available in `solr/monitoring/`, but the compose file added in this PR lives under `solr/monitoring/dev/docker-compose.yml`. Consider pointing to the `dev/` subdirectory explicitly to avoid confusion for readers looking for the compose file. ########## solr/monitoring/mixin/dashboards/dashboards.libsonnet: ########## @@ -0,0 +1,622 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// dashboards.libsonnet — Solr 10.x Grafana dashboard definition. +// +// Rows: +// Node Overview (open by default) — query/index request rates, latency, cores, disk +// JVM (open by default) — heap, GC, threads, CPU +// SolrCloud (collapsed) — Overseer queues, ZK ops, shard leaders +// Index Health (collapsed) — segments, index size, merge rates, MMap efficiency +// Cache Efficiency (collapsed) — filter/query/document cache hit rates and evictions + +local config = import '../config.libsonnet'; +local g = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet'; + +local d = g.dashboard; +local p = g.panel; +local q = g.query.prometheus; +local v = g.dashboard.variable; +local cfg = config._config; + +// ----------------------------------------------------------------------- +// Computed label selectors (uses configurable label names from config.libsonnet) +// ----------------------------------------------------------------------- +local envSel = '%s=~"$environment"' % cfg.environmentLabel; +local clusterSel = '%s=~"$cluster"' % cfg.clusterLabel; +local instSel = 'instance=~"$instance"'; +local colSel = 'collection=~"$collection",shard=~"$shard",replica_type=~"$replica_type"'; Review Comment: `config.libsonnet` exposes `instanceLabel`, but the dashboard hard-codes `instance=...` for selectors and for label_values queries. This makes the mixin less configurable than intended. Consider deriving the instance selector and any `label_values(..., instance)` uses from `cfg.instanceLabel` (and remove any now-unused locals like `colSel` if they’re not needed). ########## solr/monitoring/mixin/Dockerfile: ########## @@ -0,0 +1,55 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# ============================================================================ +# Build image for the Solr monitoring mixin. +# Contains: jsonnet, jsonnetfmt, jb (jsonnet-bundler), gojsontoyaml, promtool. +# +# Not intended for production use — developer tooling only. +# See make.sh for usage. +# ============================================================================ + +# --- Stage 1: compile Go-based tools ----------------------------------------- +FROM golang:1-bookworm AS go-tools + +ENV CGO_ENABLED=0 + +RUN go install github.com/google/go-jsonnet/cmd/jsonnet@latest \ + && go install github.com/google/go-jsonnet/cmd/jsonnetfmt@latest \ + && go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest \ + && go install github.com/brancz/gojsontoyaml@latest + +# --- Stage 2: extract promtool from the official Prometheus image ------------- +FROM prom/prometheus:latest AS prometheus Review Comment: The mixin build image installs jsonnet/jb/gojsontoyaml using `@latest` and also pulls `prom/prometheus:latest` for `promtool`. This makes regenerating artifacts non-deterministic over time (the same commit can produce different outputs). Consider pinning tool versions (and the Prometheus image/tag) so artifact regeneration is repeatable. ```suggestion RUN go install github.com/google/go-jsonnet/cmd/[email protected] \ && go install github.com/google/go-jsonnet/cmd/[email protected] \ && go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/[email protected] \ && go install github.com/brancz/[email protected] # --- Stage 2: extract promtool from the official Prometheus image ------------- FROM prom/prometheus:v2.52.0 AS prometheus ``` ########## solr/monitoring/otel-collector-solr.yml: ########## @@ -0,0 +1,125 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# otel-collector-solr.yml +# +# Drop-in OpenTelemetry Collector configuration snippet for Solr 10.x. +# +# This file shows how to: +# 1. Receive Solr metrics via OTLP (pushed from Solr's OTel module) or scrape +# Solr's /api/metrics Prometheus endpoint. +# 2. Normalize the otel_scope_name label so both ingestion paths produce +# identical label sets in Prometheus. +# 3. Promote OTel resource attributes (deployment.environment, service.namespace) +# to Prometheus metric labels so the "environment" and "cluster" Grafana +# dashboard variables work on the OTLP path. +# 4. Export to a Prometheus remote-write endpoint or expose a scrape endpoint. +# +# Merge relevant sections into your existing collector config, or use this file +# as a complete standalone config for testing. + +receivers: + # Option A: Solr pushes metrics via OTLP (recommended when the OTel module is + # enabled in Solr). Configure Solr with: + # OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 + otlp: + protocols: + grpc: + endpoint: "0.0.0.0:4317" + http: + endpoint: "0.0.0.0:4318" + + # Option B: Collector scrapes Solr's native Prometheus endpoint directly. + # Use this instead of (or in addition to) the OTLP receiver. + prometheus/solr: + config: + scrape_configs: + - job_name: solr + metrics_path: /api/metrics + scrape_interval: 15s + static_configs: + - targets: ["solr:8983"] + # Add environment and cluster labels to scraped metrics: + relabel_configs: + - target_label: environment + replacement: prod # change per environment + - target_label: cluster + replacement: cluster-1 # change per cluster + +processors: + # Normalize otel_scope_name for metrics arriving via OTLP. + # This ensures that both the OTLP path and the native Prometheus scrape path + # produce the same label set in Prometheus, allowing the dashboard to work + # regardless of which ingestion path you use. + # The dashboard must NOT filter on otel_scope_name — this transform merely + # normalizes it for consistency; it is not required for dashboard correctness. + transform/solr_scope: + metric_statements: + - context: datapoint + statements: + - set(attributes["otel_scope_name"], "org.apache.solr") + where resource.attributes["service.name"] == "solr" + + # Promote OTel resource attributes to metric labels. + # This maps deployment.environment -> environment and service.namespace -> cluster + # so that the Grafana dashboard's $environment and $cluster variables work + # on the OTLP ingestion path (matching what relabel_configs does on the scrape path). + resource/solr_labels: + attributes: + - action: upsert + key: deployment.environment + from_attribute: deployment.environment + - action: upsert + key: service.namespace + from_attribute: service.namespace + Review Comment: `resource/solr_labels` is defined but never referenced in any pipeline’s `processors` list below, so it won’t actually run. Either remove it to avoid confusion, or add it to the appropriate pipeline(s) if it is intended to be part of the recommended configuration. ```suggestion ``` ########## solr/monitoring/mixin/dashboards/dashboards.libsonnet: ########## @@ -0,0 +1,622 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// dashboards.libsonnet — Solr 10.x Grafana dashboard definition. +// +// Rows: +// Node Overview (open by default) — query/index request rates, latency, cores, disk +// JVM (open by default) — heap, GC, threads, CPU +// SolrCloud (collapsed) — Overseer queues, ZK ops, shard leaders +// Index Health (collapsed) — segments, index size, merge rates, MMap efficiency +// Cache Efficiency (collapsed) — filter/query/document cache hit rates and evictions + +local config = import '../config.libsonnet'; +local g = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet'; + +local d = g.dashboard; +local p = g.panel; +local q = g.query.prometheus; +local v = g.dashboard.variable; +local cfg = config._config; + +// ----------------------------------------------------------------------- +// Computed label selectors (uses configurable label names from config.libsonnet) +// ----------------------------------------------------------------------- +local envSel = '%s=~"$environment"' % cfg.environmentLabel; +local clusterSel = '%s=~"$cluster"' % cfg.clusterLabel; +local instSel = 'instance=~"$instance"'; +local colSel = 'collection=~"$collection",shard=~"$shard",replica_type=~"$replica_type"'; + +// ----------------------------------------------------------------------- +// Template variables (T012) +// Ordered: datasource → environment → cluster → instance → +// collection → shard → replica_type → interval +// ----------------------------------------------------------------------- +local datasourceVar = + v.datasource.new('datasource', 'prometheus') + + v.datasource.generalOptions.withLabel('Data Source'); + +local environmentVar = + v.query.new( + 'environment', + 'label_values(solr_cores_loaded, %s)' % cfg.environmentLabel + ) + + v.query.withDatasourceFromVariable(datasourceVar) + + v.query.selectionOptions.withMulti() + + v.query.selectionOptions.withIncludeAll(value=true, customAllValue='.*') + + v.query.refresh.onTime() + + v.query.generalOptions.withLabel('Environment'); + +local clusterVar = + v.query.new( + 'cluster', + 'label_values(solr_cores_loaded{%s}, %s)' % [envSel, cfg.clusterLabel] + ) + + v.query.withDatasourceFromVariable(datasourceVar) + + v.query.selectionOptions.withMulti() + + v.query.selectionOptions.withIncludeAll(value=true, customAllValue='.*') + + v.query.refresh.onTime() + + v.query.generalOptions.withLabel('Cluster'); + +local instanceVar = + v.query.new( + 'instance', + 'label_values(solr_cores_loaded{%s,%s}, instance)' % [envSel, clusterSel] + ) + + v.query.withDatasourceFromVariable(datasourceVar) + + v.query.selectionOptions.withMulti() + + v.query.selectionOptions.withIncludeAll(value=true, customAllValue='.*') + + v.query.refresh.onTime() + + v.query.generalOptions.withLabel('Instance'); + +local collectionVar = + v.query.new( + 'collection', + 'label_values(solr_core_requests_total{%s}, collection)' % instSel + ) + + v.query.withDatasourceFromVariable(datasourceVar) + + v.query.selectionOptions.withMulti() + + v.query.selectionOptions.withIncludeAll(value=true, customAllValue='.*') + + v.query.refresh.onTime() + + v.query.generalOptions.withLabel('Collection'); + +local shardVar = + v.query.new( + 'shard', + 'label_values(solr_core_requests_total{%s,collection=~"$collection"}, shard)' % instSel + ) + + v.query.withDatasourceFromVariable(datasourceVar) + + v.query.selectionOptions.withMulti() + + v.query.selectionOptions.withIncludeAll(value=true, customAllValue='.*') + + v.query.refresh.onTime() + + v.query.generalOptions.withLabel('Shard'); + +local replicaTypeVar = + v.query.new( + 'replica_type', + 'label_values(solr_core_requests_total{%s,collection=~"$collection"}, replica_type)' % instSel + ) + + v.query.withDatasourceFromVariable(datasourceVar) + + v.query.selectionOptions.withMulti() + + v.query.selectionOptions.withIncludeAll(value=true, customAllValue='.*') + + v.query.refresh.onTime() + + v.query.generalOptions.withLabel('Replica Type'); + +local intervalVar = + v.interval.new('interval', ['1m', '5m', '10m', '30m', '1h']) + + v.interval.generalOptions.withCurrent('1m') Review Comment: `config.libsonnet` defines `defaultRateInterval`, but the dashboard interval variable defaults to `1m` and the query helper always uses `$interval`. This mismatch makes the config setting ineffective. Consider wiring the dashboard’s interval default (and/or the variable option list) to `cfg.defaultRateInterval` so changing it in config actually changes the generated dashboard behavior. ```suggestion + v.interval.generalOptions.withCurrent(cfg.defaultRateInterval) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
