imbajin commented on code in PR #2952:
URL:
https://github.com/apache/incubator-hugegraph/pull/2952#discussion_r2845287905
##########
hugegraph-server/hugegraph-dist/src/assembly/static/bin/wait-storage.sh:
##########
@@ -70,7 +70,28 @@ done < <(env | sort -r | awk -F= '{ st = index($0, "=");
print $1 " " substr($0,
# wait for storage
if env | grep '^hugegraph\.' > /dev/null; then
if [ -n "${WAIT_STORAGE_TIMEOUT_S:-}" ]; then
- timeout "${WAIT_STORAGE_TIMEOUT_S}s" bash -c \
- "until bin/gremlin-console.sh -- -e $DETECT_STORAGE > /dev/null 2>&1;
do echo \"Hugegraph server are waiting for storage backend...\"; sleep 5; done"
+ # Extract pd.peers from config or environment
+ PD_PEERS="${hugegraph_pd_peers:-}"
+ if [ -z "$PD_PEERS" ]; then
+ PD_PEERS=$(grep -E "^\s*pd\.peers\s*=" "$GRAPH_CONF" | sed
's/.*=\s*//' | tr -d ' ')
+ fi
+
+ if [ -n "$PD_PEERS" ]; then
+ # Convert gRPC address to REST address (8686 -> 8620)
+ PD_REST=$(echo "$PD_PEERS" | sed 's/:8686/:8620/g' | cut -d',' -f1)
+ echo "Waiting for PD REST endpoint at $PD_REST..."
+
+ timeout "${WAIT_STORAGE_TIMEOUT_S}s" bash -c "
+ until curl -fsS http://${PD_REST}/v1/health >/dev/null 2>&1; do
+ echo 'Hugegraph server are waiting for storage backend...'
+ sleep 5
+ done
+ echo 'PD is reachable, waiting extra 10s for store
registration...'
+ sleep 10
+ echo 'Storage backend is ready!'
+ " || echo "Warning: Timeout waiting for storage, proceeding
anyway..."
+ else
+ echo "No pd.peers configured, skipping storage wait..."
+ fi
Review Comment:
> Partition-based readiness checks may be unreliable since partition
assignment occurs asynchronously after wait-storage completes, and a properly
registered Store can legitimately report partitionCount = 0 during normal
initialization. Interpreting this as a failure condition could unintentionally
block startup in otherwise healthy clusters. Would it make sense to consider
separating validation into pre-startup checks (Store/PD availability) and
post-startup checks (partition stabilization) instead?
Your point is right: `partitionCount` can legitimately be `0` during early
initialization, so using it as a strict pre-start gate can block healthy
clusters.
The core concern I want to preserve is:
1. Current readiness is too weak (`PD /v1/health + fixed sleep` only).
2. Timeout currently does `warning + continue`, which allows starting in a
broken state (no fail-fast).
Maybe we could split validation into pre-start hard gates vs post-start
stabilization checks:
```text
docker compose up
|
v
[Pre-start Gate] (must pass, otherwise exit)
- PD /v1/health == OK
- Store /v1/health == OK
- PD /v1/stores has at least 1 Up store
|
v
[Init-store with retries] (must succeed, otherwise exit)
|
v
[Start server]
|
v
[Post-start checks] (non-blocking)
- observe partition/leader stabilization
- warn if not converged yet, but do not treat as pre-start failure
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]