[PR] Fix Interconnect High Retry [cloudberry]

via GitHub Sun, 21 Sep 2025 19:52:31 -0700


oracleloyall opened a new pull request, #1365:
URL: https://github.com/apache/cloudberry/pull/1365


   The primary goal is to address various issues currently encountered during 
concurrent processes,
   such as excessive motion retries, congestion, retransmission storms, and 
network skew.
   The code addresses inefficient network retransmission handling in unreliable 
network environments. Specifically:
   
   Fixed Timeout Thresholds: Traditional TCP-style Retransmission Timeout 
(RTTVAR.RTO) calculations may be too rigid for networks with volatile latency 
(e.g., satellite links, wireless networks). This leads to: • Premature 
Retransmissions: Unnecessary data resends during temporary latency spikes, 
wasting bandwidth.
   • Delayed Recovery: Slow reaction to actual packet loss when RTO is overly 
conservative.
   
   Lack of Context Awareness: Static RTO ignores real-time network behavior 
patterns, reducing throughput and responsiveness.
   
   Solution: Dynamic Timeout Threshold Adjustment
   Implements an adaptive timeout mechanism to optimize retransmission: if (now 
< (curBuf->sentTime + conn->rttvar.rto)) { uint32_t diff = (curBuf->sentTime + 
conn->rttvar.rto) - now; // ... (statistical tracking and threshold adjustment) 
}
   
   Key Components:
   • Statistical Tracking:
   \- min/max: Tracks observed minimum/maximum residual time (time left until 
RTO expiry).
   \- retrans_count/no_retrans_count: Counts retransmission vs. 
non-retransmission events.
   
   • Weighted Threshold Calculation:
   unack_queue_ring.time_difference = (uint32_t)(
   unack_queue_ring.max * weight_no_retrans +
   unack_queue_ring.min * weight_retrans
   );
   Weights derived from historical ratios of retransmissions (weight_retrans) 
vs. successful deliveries (weight_no_retrans).
   
   How It Solves the Problem:
   • Temporary Latency Spike: Uses max (conservative) to avoid false 
retransmits, reducing bandwidth waste (vs. traditional mistaken 
retransmissions).
   • Persistent Packet Loss: Prioritizes min (aggressive) via weight_retrans, 
accelerating recovery (vs. slow fixed-RTO reaction). • Stable Network: Balances 
weights for equilibrium throughput (vs. static RTO limitations).
   
   EstimateRTT - Dynamically estimates the Round-Trip Time (RTT) and adjusts 
Retransmission Timeout (RTO)
   
   This function implements a variant of the Jacobson/Karels algorithm for RTT 
estimation, adapted for UDP-based motion control connections. It updates 
smoothed RTT (srtt), mean deviation (mdev), and RTO values based on newly 
measured RTT samples (mrtt). The RTO calculation ensures reliable data 
transmission over unreliable networks.
   
   Key Components:
   
   * srtt:   Smoothed Round-Trip Time (weighted average of historical RTT 
samples)
   * mdev:   Mean Deviation (measure of RTT variability)
   * rttvar: Adaptive RTT variation bound (used to clamp RTO updates)
   * rto:    Retransmission Timeout (dynamically adjusted based on srtt + 
rttvar)
   
   Algorithm Details:
   
   1. For the first RTT sample:
      srtt    = mrtt << 3   (scaled by 8 for fixed-point arithmetic)
      mdev    = mrtt << 1   (scaled by 2)
      rttvar  = max(mdev, rto_min)
   2. For subsequent samples:
      Delta   = mrtt - (srtt >> 3)  (difference between new sample and smoothed 
RTT)
      srtt   += Delta               (update srtt with 1/8 weight of new sample)
      Delta   = abs(Delta) - (mdev >> 2)
      mdev   += Delta               (update mdev with 1/4 weight)
   3. rttvar bounds the maximum RTT variation:
      If mdev > mdev_max, update mdev_max and rttvar
      On new ACKs (snd_una > rtt_seq), decay rttvar toward mdev\_max
   4. Final RTO calculation:
      rto = (srtt >> 3) + rttvar   (clamped to RTO_MAX)
   
   Network Latency Filtering and RTO Optimization
   
   This logic mitigates RTO distortion caused by non-network delays in database 
execution pipelines. Key challenges addressed:
   
   * Operator processing delays (non-I/O wait) inflate observed ACK times
   * Spurious latency amplification in lossy networks triggers excessive 
RTO_MAX waits
   * Congestion collapse from synchronized retransmissions
   
   Core Mechanisms:
   
   1. Valid RTT Sampling Filter: Condition: 4 * (pkt->recv_time - 
pkt->send_time) > ackTime && pkt->retry_times != 
Gp_interconnect_min_retries_before_timeout Rationale:
   
      * Filters packets exceeding 2x expected round-trips (4x one-way)
      * Excludes artificial retries 
(retry_times=Gp_interconnect_min_retries_before_timeout) to avoid sampling bias 
Action: Update RTT estimation only with valid samples via EstimateRTT()
   
   2. Randomized Backoff: Condition: buf->nRetry > 0 Algorithm: rto += (rto >> 
(4 * buf->nRetry)) Benefits:
   
      * Exponential decay: Shifts create geometrically decreasing increments
      * Connection-specific randomization: Prevents global synchronization
      * Dynamic scaling: Adapts to retry depth (nRetry)
   
   3. Timer List Management (NEW_TIMER): Operations: RemoveFromRTOList(&mudp, 
bufConn) → Detaches from monitoring
      AddtoRTOList(\&mudp, bufConn)       → Reinserts with updated rto
      Purpose: Maintains real-time ordering of expiration checks
   
   We conducted multiple full-scale TPCDS benchmarks using both a single 
physical machine with 48 nodes and four physical machines with 96 nodes, 
testing with MTU values of 1500 and 9000. In the single-node environment with 
no network bottlenecks, there were no significant performance differences 
between using MTU 1500 and 9000. In the 96-node environment, under 
single-threaded execution, there were no significant performance differences. 
However, under multi-threaded execution (4 threads), SQL statements with a high 
percentage of data movement showed significant performance variations, ranging 
from 5 to 10 times, especially with MTU 1500.
   
   <!-- Thank you for your contribution to Apache Cloudberry (Incubating)! -->
   
   Fixes #1065
   
   ### What does this PR do?
   <!-- Brief overview of the changes, including any major features or fixes -->
   The primary goal is to address various issues currently encountered during 
concurrent processes, such as excessive motion retries, congestion, 
retransmission storms, and network skew. The code addresses inefficient network 
retransmission handling in unreliable network environments.
   ### Type of Change
   - [x] Bug fix (non-breaking change)
   - [x] New feature (non-breaking change)
   - [ ] Breaking change (fix or feature with breaking changes)
   - [ ] Documentation update
   
   ### Breaking Changes
   <!-- Remove if not applicable. If yes, explain impact and migration path -->
   
   ### Test Plan
   <!-- How did you test these changes? -->
   - [x] Unit tests added/updated
   - [x] Integration tests added/updated
   - [x] Passed `make installcheck`
   - [x] Passed `make -C src/test installcheck-cbdb-parallel`
   
   ### Impact
   <!-- Remove sections that don't apply -->
   **Performance:**
   <!-- Any performance implications? -->
   
   **User-facing changes:**
   <!-- Any changes visible to users? -->
   
   **Dependencies:**
   <!-- New dependencies or version changes? -->
   
   ### Checklist
   - [ ] Followed [contribution 
guide](https://cloudberry.apache.org/contribute/code)
   - [ ] Added/updated documentation
   - [ ] Reviewed code for security implications
   - [ ] Requested review from [cloudberry 
committers](https://github.com/orgs/apache/teams/cloudberry-committers)
   
   ### Additional Context
   <!-- Any other information that would help reviewers? Remove if none -->
   
   ### CI Skip Instructions
   <!--
   To skip CI builds, add the appropriate CI skip identifier to your PR title.
   The identifier must:
   - Be in square brackets []
   - Include the word "ci" and either "skip" or "no"
   - Only use for documentation-only changes or when absolutely necessary
   -->
   
   ---
   <!-- Join our community:
   - Mailing list: 
[d...@cloudberry.apache.org](https://lists.apache.org/list.html?d...@cloudberry.apache.org)
 (subscribe: dev-subscr...@cloudberry.apache.org)
   - Discussions: https://github.com/apache/cloudberry/discussions -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudberry.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cloudberry.apache.org
For additional commands, e-mail: commits-h...@cloudberry.apache.org

[PR] Fix Interconnect High Retry [cloudberry]

Reply via email to