imbajin commented on code in PR #2952:
URL: 
https://github.com/apache/incubator-hugegraph/pull/2952#discussion_r2845287905


##########
hugegraph-server/hugegraph-dist/src/assembly/static/bin/wait-storage.sh:
##########
@@ -70,7 +70,28 @@ done < <(env | sort -r | awk -F= '{ st = index($0, "="); 
print $1 " " substr($0,
 # wait for storage
 if env | grep '^hugegraph\.' > /dev/null; then
     if [ -n "${WAIT_STORAGE_TIMEOUT_S:-}" ]; then
-        timeout "${WAIT_STORAGE_TIMEOUT_S}s" bash -c \
-        "until bin/gremlin-console.sh -- -e $DETECT_STORAGE > /dev/null 2>&1; 
do echo \"Hugegraph server are waiting for storage backend...\"; sleep 5; done"
+        # Extract pd.peers from config or environment
+        PD_PEERS="${hugegraph_pd_peers:-}"
+        if [ -z "$PD_PEERS" ]; then
+            PD_PEERS=$(grep -E "^\s*pd\.peers\s*=" "$GRAPH_CONF" | sed 
's/.*=\s*//' | tr -d ' ')
+        fi
+
+        if [ -n "$PD_PEERS" ]; then
+            # Convert gRPC address to REST address (8686 -> 8620)
+            PD_REST=$(echo "$PD_PEERS" | sed 's/:8686/:8620/g' | cut -d',' -f1)
+            echo "Waiting for PD REST endpoint at $PD_REST..."
+
+            timeout "${WAIT_STORAGE_TIMEOUT_S}s" bash -c "
+                until curl -fsS http://${PD_REST}/v1/health >/dev/null 2>&1; do
+                    echo 'Hugegraph server are waiting for storage backend...'
+                    sleep 5
+                done
+                echo 'PD is reachable, waiting extra 10s for store 
registration...'
+                sleep 10
+                echo 'Storage backend is ready!'
+            " || echo "Warning: Timeout waiting for storage, proceeding 
anyway..."
+        else
+            echo "No pd.peers configured, skipping storage wait..."
+        fi

Review Comment:
   > Partition-based readiness checks may be unreliable since partition 
assignment occurs asynchronously after wait-storage completes, and a properly 
registered Store can legitimately report partitionCount = 0 during normal 
initialization. Interpreting this as a failure condition could unintentionally 
block startup in otherwise healthy clusters. Would it make sense to consider 
separating validation into pre-startup checks (Store/PD availability) and 
post-startup checks (partition stabilization) instead?
   
   Your point is right: `partitionCount` can legitimately be `0` during early 
initialization, so using it as a strict pre-start gate can block healthy 
clusters.
   
   The core concern I want to preserve is:
   1. Current readiness is too weak (`PD /v1/health + fixed sleep` only).  
   2. Timeout currently does `warning + continue`, which allows starting in a 
broken state (no fail-fast).
   
   Maybe we could split validation into pre-start hard gates vs post-start 
stabilization checks:
   
   ```text
   docker compose up
      |
      v
   [Pre-start Gate]  (must pass, otherwise exit)
     - PD /v1/health == OK
     - Store /v1/health == OK
     - PD /v1/stores has at least 1 Up store
      |
      v
   [Init-store with retries] (must succeed, otherwise exit)
      |
      v
   [Start server]
      |
      v
   [Post-start checks] (non-blocking)
     - observe partition/leader stabilization
     - warn if not converged yet, but do not treat as pre-start failure
     ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to