Serge Huber created UNOMI-946:
---------------------------------
Summary: Add search engine cluster health checks and replica
enforcement to IT harness before and after migration
Key: UNOMI-946
URL: https://issues.apache.org/jira/browse/UNOMI-946
Project: Apache Unomi
Issue Type: Sub-task
Components: unomi(-core)
Affects Versions: unomi-3.1.0
Reporter: Serge Huber
Fix For: unomi-3.1.0
The IT harness can enter a yellow/red cluster state during the test suite due
to two root causes:
# *Replica settings cannot be set via container environment variables.*
{{index.number_of_replicas}} is an index-level setting in ES 9.x and OS 3.x; it
is explicitly rejected in node config / Docker env vars. The current
{{pom.xml}} correctly uses env vars for routing settings (e.g.
{{{}cluster.routing.allocation.disk.threshold_enabled{}}}) but has no mechanism
to enforce 0 replicas on indices.
# *ES 9.x creates system data streams with {{number_of_replicas=1}} during
test execution.* The cluster starts GREEN (Unomi's own indices have
{{{}numberOfReplicas=0{}}}), but as the test suite progresses, ES internally
creates background data streams (deprecation log streams, ILM bookkeeping,
etc.) with the default replica count. On a single-node container these cannot
be allocated, transitioning the cluster to YELLOW.
# *Snapshot restore brings back 1.6.x indices with their original replica
settings.* {{Migrate16xToCurrentVersionIT}} restores a snapshot taken against a
cluster with the ES default {{{}number_of_replicas=1{}}}. All restored indices
come back with 1 replica, immediately putting the cluster in a YELLOW state
that the migration then runs against.
# *{{minimalClusterState}} asymmetry in BaseIT.* The OpenSearch path in BaseIT
already overrides {{minimalClusterState=YELLOW}} (line 619), but the
Elasticsearch path has no equivalent override, so
{{ElasticSearchHealthCheckProvider}} returns {{UP}} instead of {{LIVE}} and
{{HealthCheckIT}} fails.
*Observed symptom:* {{HealthCheckIT.testHealthCheck}} and
{{testConcurrentHealthCheck}} fail consistently when run with the Elasticsearch
profile because the health endpoint returns HTTP 206 with
{{elasticsearch.status=UP}} instead of {{{}LIVE{}}}.
----
*Proposed changes:*
Three check/fix anchor points, all using direct HTTP calls to
{{localhost:\{getSearchPort()}}} (container is up, security is disabled, no
auth required):
*Point 0 — {{BaseIT.checkSearchEngine()}} (runs first in both migration and
non-migration paths)*
* {{PUT /_cluster/settings}} → {{{"persistent": \{"index.number_of_replicas":
"0"}}}} — enforces 0 replicas as the cluster-wide default for all subsequently
created indices (Unomi indices, ES system indices, migration reindex targets)
* {{GET /_cluster/health}} + log baseline cluster status
*Point 1 — {{{}Migrate16xToCurrentVersionIT.waitForStartup(){}}}, immediately
after snapshot restore*
* {{PUT /_all/_settings}} → {{{"index": \{"number_of_replicas": "0"}}}} —
retroactively zeros out replicas on all restored 1.6.x indices
* {{GET /_cluster/health?wait_for_status=green&timeout=30s}} — wait for GREEN
before starting the migration; fail fast with diagnosis if not reached
*Point 2 — After migration + {{super.waitForStartup()}} (migration path) and
after {{unomi:start}} (non-migration path)*
* {{GET /_cat/indices?h=index,rep,health&format=json}} — assert no index has
{{{}rep > 0{}}}; log any violations
* {{GET /_cluster/health}} — assert GREEN; surface which indices have
unassigned shards if not
Additionally, add the missing {{minimalClusterState=YELLOW}} override for
Elasticsearch to BaseIT (mirroring the OpenSearch line) as a
belt-and-suspenders fallback.
*Code structure:*
* {{BaseIT.configureSearchEngineForTesting()}} — reusable method for Point 0,
called from {{checkSearchEngine()}}
* {{BaseIT.assertClusterHealthy(String context)}} — reusable method for Point 2
* {{Migrate16xToCurrentVersionIT.fixRestoredIndexReplicas()}} —
migration-specific Point 1
----
*Acceptance criteria:*
* {{HealthCheckIT.testHealthCheck}} and {{testConcurrentHealthCheck}} pass
consistently with the Elasticsearch profile on a single-node container
* {{Migrate16xToCurrentVersionIT}} restores the 1.6.x snapshot and runs the
migration against a GREEN cluster
* Any index with {{number_of_replicas > 0}} after startup or migration causes
a clear, early diagnostic failure with the index name(s) listed — not a silent
YELLOW cluster
* Both Elasticsearch and OpenSearch paths are covered
* {{configureSearchEngineForTesting()}} and {{assertClusterHealthy()}} are
reusable from any future migration IT
--
This message was sent by Atlassian Jira
(v8.20.10#820010)