CRZbulabula opened a new pull request, #17595: URL: https://github.com/apache/iotdb/pull/17595
## Summary - **Adaptive topology probing interval**: Probing interval now scales with DataNode count (`max(base, base × N / refN)`), preventing fixed-interval overhead from dominating large clusters. Configurable via `topology_probing_base_interval_in_ms`, `topology_probing_reference_node_count`, and `topology_probing_timeout_ratio`. - **√N sampling with batch rotation**: Each probing cycle selects only `ceil(√N)` DataNodes as probers (instead of all N), rotating across cycles for full coverage. Reduces per-cycle RPC fan-out from O(N) to O(√N) and total connection tests from O(N²) to O(N√N). - **Independent topology push channel**: Topology is no longer piggybacked on the 1-second heartbeat. `TopologyService` pushes updates directly to DataNodes via dedicated heartbeat RPC, only when a DataNode's reachable set changes. Eliminates O(N²) per-heartbeat payload and the `compareAndSet`-based update-loss issue. - **Timeout protection for test connection RPCs**: `submitInternalTestConnectionTask` and `submitTestConnectionTask` on DataNode now use bounded timeouts (`dnConnectionTimeoutInMS`) instead of unbounded `CountDownLatch.await()`. Prevents internal RPC threads from blocking indefinitely when target nodes are unreachable. - **Isolated topology probing thread pool on DataNode**: `submitInternalTestConnectionTask` handler offloads blocking work to a dedicated 2-thread `TOPOLOGY_PROBING_EXECUTOR` with `Future.get(timeout)`, so the `DataNodeInternalRPCService` thread pool (shared with `sendFragmentInstance`, consensus, etc.) is no longer held hostage by slow probes. - **Increased heartbeat selector threads**: `AsyncDataNodeHeartbeatServiceClientPoolFactory` and `AsyncConfigNodeHeartbeatServiceClientPoolFactory` now use a dedicated `heartbeat_selector_num_of_client_manager` config (default 4, up from 1), preventing the single NIO selector thread from becoming a serial bottleneck when processing N heartbeat responses. - **Fixed `collectPipeMetaList` lock timeout**: Changed from `dnConnectionTimeoutInMS * 2/3` (≈ 40000 **seconds** due to unit mismatch — effectively 11 hours) to 2 seconds. If the pipe meta lock cannot be acquired within 2s, the collection is skipped for this heartbeat cycle and retried in the next (every ~100 heartbeats). - **Fixed topology diff logging bugs**: `LoadCache.updateTopology()` and `ClusterTopology.updateTopology()` both had copy-paste bugs where `originReachable` read from `latestTopology` instead of the old topology, making the diff comparison always equal and the change-log dead code. ## Test plan - [ ] Verify `TopologyService` adaptive interval: deploy a 3-node cluster, confirm probing interval equals `base_interval`; add nodes beyond `reference_node_count`, confirm interval scales proportionally - [ ] Verify √N sampling: with 9+ DataNodes, confirm only `ceil(√N)` probers are selected per cycle via `[Topology]` log lines; confirm all nodes rotate through as probers over multiple cycles - [ ] Verify topology push: confirm `[Topology] latest view from config-node` appears on DataNode logs only when reachable set changes, not every heartbeat - [ ] Verify timeout protection: stop one DataNode, confirm other DataNodes' internal RPC threads are not blocked beyond `dnConnectionTimeoutInMS` - [ ] Verify `collectPipeMetaList` timeout: create a pipe, confirm heartbeat handler does not block for more than 2 seconds even under pipe lock contention - [ ] Verify topology diff logging: trigger a network partition, confirm `[Topology] Topology of DataNode X is now unreachable to DataNode Y` log entries appear correctly - [ ] Regression: run existing TopologyService and HeartbeatService integration tests 🤖 Generated with [Claude Code](https://claude.ai/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
