Serge Huber created UNOMI-946:
---------------------------------

             Summary: Add search engine cluster health checks and replica 
enforcement to IT harness before and after migration
                 Key: UNOMI-946
                 URL: https://issues.apache.org/jira/browse/UNOMI-946
             Project: Apache Unomi
          Issue Type: Sub-task
          Components: unomi(-core)
    Affects Versions: unomi-3.1.0
            Reporter: Serge Huber
             Fix For: unomi-3.1.0


The IT harness can enter a yellow/red cluster state during the test suite due 
to two root causes:
 # *Replica settings cannot be set via container environment variables.* 
{{index.number_of_replicas}} is an index-level setting in ES 9.x and OS 3.x; it 
is explicitly rejected in node config / Docker env vars. The current 
{{pom.xml}} correctly uses env vars for routing settings (e.g. 
{{{}cluster.routing.allocation.disk.threshold_enabled{}}}) but has no mechanism 
to enforce 0 replicas on indices.

 # *ES 9.x creates system data streams with {{number_of_replicas=1}} during 
test execution.* The cluster starts GREEN (Unomi's own indices have 
{{{}numberOfReplicas=0{}}}), but as the test suite progresses, ES internally 
creates background data streams (deprecation log streams, ILM bookkeeping, 
etc.) with the default replica count. On a single-node container these cannot 
be allocated, transitioning the cluster to YELLOW.

 # *Snapshot restore brings back 1.6.x indices with their original replica 
settings.* {{Migrate16xToCurrentVersionIT}} restores a snapshot taken against a 
cluster with the ES default {{{}number_of_replicas=1{}}}. All restored indices 
come back with 1 replica, immediately putting the cluster in a YELLOW state 
that the migration then runs against.

 # *{{minimalClusterState}} asymmetry in BaseIT.* The OpenSearch path in BaseIT 
already overrides {{minimalClusterState=YELLOW}} (line 619), but the 
Elasticsearch path has no equivalent override, so 
{{ElasticSearchHealthCheckProvider}} returns {{UP}} instead of {{LIVE}} and 
{{HealthCheckIT}} fails.

*Observed symptom:* {{HealthCheckIT.testHealthCheck}} and 
{{testConcurrentHealthCheck}} fail consistently when run with the Elasticsearch 
profile because the health endpoint returns HTTP 206 with 
{{elasticsearch.status=UP}} instead of {{{}LIVE{}}}.
----
*Proposed changes:*

Three check/fix anchor points, all using direct HTTP calls to 
{{localhost:\{getSearchPort()}}} (container is up, security is disabled, no 
auth required):

*Point 0 — {{BaseIT.checkSearchEngine()}} (runs first in both migration and 
non-migration paths)*
 * {{PUT /_cluster/settings}} → {{{"persistent": \{"index.number_of_replicas": 
"0"}}}} — enforces 0 replicas as the cluster-wide default for all subsequently 
created indices (Unomi indices, ES system indices, migration reindex targets)
 * {{GET /_cluster/health}} + log baseline cluster status

*Point 1 — {{{}Migrate16xToCurrentVersionIT.waitForStartup(){}}}, immediately 
after snapshot restore*
 * {{PUT /_all/_settings}} → {{{"index": \{"number_of_replicas": "0"}}}} — 
retroactively zeros out replicas on all restored 1.6.x indices
 * {{GET /_cluster/health?wait_for_status=green&timeout=30s}} — wait for GREEN 
before starting the migration; fail fast with diagnosis if not reached

*Point 2 — After migration + {{super.waitForStartup()}} (migration path) and 
after {{unomi:start}} (non-migration path)*
 * {{GET /_cat/indices?h=index,rep,health&format=json}} — assert no index has 
{{{}rep > 0{}}}; log any violations
 * {{GET /_cluster/health}} — assert GREEN; surface which indices have 
unassigned shards if not

Additionally, add the missing {{minimalClusterState=YELLOW}} override for 
Elasticsearch to BaseIT (mirroring the OpenSearch line) as a 
belt-and-suspenders fallback.

*Code structure:*
 * {{BaseIT.configureSearchEngineForTesting()}} — reusable method for Point 0, 
called from {{checkSearchEngine()}}
 * {{BaseIT.assertClusterHealthy(String context)}} — reusable method for Point 2
 * {{Migrate16xToCurrentVersionIT.fixRestoredIndexReplicas()}} — 
migration-specific Point 1

----
*Acceptance criteria:*
 *  {{HealthCheckIT.testHealthCheck}} and {{testConcurrentHealthCheck}} pass 
consistently with the Elasticsearch profile on a single-node container
 *  {{Migrate16xToCurrentVersionIT}} restores the 1.6.x snapshot and runs the 
migration against a GREEN cluster
 *  Any index with {{number_of_replicas > 0}} after startup or migration causes 
a clear, early diagnostic failure with the index name(s) listed — not a silent 
YELLOW cluster
 *  Both Elasticsearch and OpenSearch paths are covered
 *  {{configureSearchEngineForTesting()}} and {{assertClusterHealthy()}} are 
reusable from any future migration IT



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to