CRZbulabula opened a new pull request, #17595:
URL: https://github.com/apache/iotdb/pull/17595

   ## Summary
   
   - **Adaptive topology probing interval**: Probing interval now scales with 
DataNode count (`max(base, base × N / refN)`), preventing fixed-interval 
overhead from dominating large clusters. Configurable via 
`topology_probing_base_interval_in_ms`, 
`topology_probing_reference_node_count`, and `topology_probing_timeout_ratio`.
   - **√N sampling with batch rotation**: Each probing cycle selects only 
`ceil(√N)` DataNodes as probers (instead of all N), rotating across cycles for 
full coverage. Reduces per-cycle RPC fan-out from O(N) to O(√N) and total 
connection tests from O(N²) to O(N√N).
   - **Independent topology push channel**: Topology is no longer piggybacked 
on the 1-second heartbeat. `TopologyService` pushes updates directly to 
DataNodes via dedicated heartbeat RPC, only when a DataNode's reachable set 
changes. Eliminates O(N²) per-heartbeat payload and the `compareAndSet`-based 
update-loss issue.
   - **Timeout protection for test connection RPCs**: 
`submitInternalTestConnectionTask` and `submitTestConnectionTask` on DataNode 
now use bounded timeouts (`dnConnectionTimeoutInMS`) instead of unbounded 
`CountDownLatch.await()`. Prevents internal RPC threads from blocking 
indefinitely when target nodes are unreachable.
   - **Isolated topology probing thread pool on DataNode**: 
`submitInternalTestConnectionTask` handler offloads blocking work to a 
dedicated 2-thread `TOPOLOGY_PROBING_EXECUTOR` with `Future.get(timeout)`, so 
the `DataNodeInternalRPCService` thread pool (shared with 
`sendFragmentInstance`, consensus, etc.) is no longer held hostage by slow 
probes.
   - **Increased heartbeat selector threads**: 
`AsyncDataNodeHeartbeatServiceClientPoolFactory` and 
`AsyncConfigNodeHeartbeatServiceClientPoolFactory` now use a dedicated 
`heartbeat_selector_num_of_client_manager` config (default 4, up from 1), 
preventing the single NIO selector thread from becoming a serial bottleneck 
when processing N heartbeat responses.
   - **Fixed `collectPipeMetaList` lock timeout**: Changed from 
`dnConnectionTimeoutInMS * 2/3` (≈ 40000 **seconds** due to unit mismatch — 
effectively 11 hours) to 2 seconds. If the pipe meta lock cannot be acquired 
within 2s, the collection is skipped for this heartbeat cycle and retried in 
the next (every ~100 heartbeats).
   - **Fixed topology diff logging bugs**: `LoadCache.updateTopology()` and 
`ClusterTopology.updateTopology()` both had copy-paste bugs where 
`originReachable` read from `latestTopology` instead of the old topology, 
making the diff comparison always equal and the change-log dead code.
   
   ## Test plan
   
   - [ ] Verify `TopologyService` adaptive interval: deploy a 3-node cluster, 
confirm probing interval equals `base_interval`; add nodes beyond 
`reference_node_count`, confirm interval scales proportionally
   - [ ] Verify √N sampling: with 9+ DataNodes, confirm only `ceil(√N)` probers 
are selected per cycle via `[Topology]` log lines; confirm all nodes rotate 
through as probers over multiple cycles
   - [ ] Verify topology push: confirm `[Topology] latest view from 
config-node` appears on DataNode logs only when reachable set changes, not 
every heartbeat
   - [ ] Verify timeout protection: stop one DataNode, confirm other DataNodes' 
internal RPC threads are not blocked beyond `dnConnectionTimeoutInMS`
   - [ ] Verify `collectPipeMetaList` timeout: create a pipe, confirm heartbeat 
handler does not block for more than 2 seconds even under pipe lock contention
   - [ ] Verify topology diff logging: trigger a network partition, confirm 
`[Topology] Topology of DataNode X is now unreachable to DataNode Y` log 
entries appear correctly
   - [ ] Regression: run existing TopologyService and HeartbeatService 
integration tests
   
   🤖 Generated with [Claude Code](https://claude.ai/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to