>>> For example in an ideal world putting more or less DB servers >>> in the client 'CellServDB' should not matter, as long as one >>> that belongs to the cell is up; again if the logic were for >>> all types of client: "scan quickly the list of potential DB >>> servers, find one that is up and belongs to the cell and >>> reckons is part of the quorum, and if necessary get from it >>> the address of the sync site".
> The problem is that you the client to scan "quickly" to find a > server that is up, but because networks are not perfectly > reliable and drop packets all the time, it cannot know that a > server is not up until that server has failed to respond to > multiple retransmissions of the request. That has nothing to do with how quickly the probes are sent... > Those retransmissions cannot be sent "quickly"; in fact, they > _must_ be sent with exponentially-increasing backoff times. That has nothing to do with how "quickly" they can be sent... The duration of the intervals betwen the probes is a different matter from what should be the ratio of intervals. > Otherwise, when your network becomes congested, the > retransmission of dropped packets will act as a runaway positive > feedback loop, making the congestion worse and saturating the > network. I am sorry I have not been clear about the topic: I was not meaning to discussing flow control is back-to-back streaming connections, my concern was about the frequency of *probing* servers for accessibility. Discovering the availability of DB servers is not the same thing as streaming data from/to a fileserver, both in nature and as to amount of traffic involved. In TCP congestion control for example one could be talking about streams of 100,000x 8192B packets per second. DB database discovery But even if I had meant to discuss back-to-back streaming packet congestion control, the absolute numbers are still vastly different. In the case of *probing* for the liveness of a *single* DB server I have observed the 'vos' command send packets with these intervals: «The wait times after the 8 attempts are: 3.6s, 6.8s, 13.2s, 21.4s, 4.6s, 25.4s, 26.2s, 3.8s.» with randomish variations around that, That's around 5 packets per minute with intervals between 3,600ms and 26,200ms. Again, to a single DB server, not say roundrobin to all DB servers in 'CellServDB'. With TCP congestion control back of (the 'RTO' parameter) for 200ms (two hundreds milliseconds). With another rather different distributed filesystem, Lustre, I observed some issue with that very long backoff time, with high throughput (600-800MB/s) back-to-back packet streams, and there is a significant amount of research that on fast low latency links 200ms RTO seems way excessive. For example in a paper that is already 5 years old: http://www.cs.cmu.edu/~dga/papers/incast-sigcomm2009.pdf «Under severe packet loss, TCP can experience a timeout that lasts a minimum of 200ms, determined by the TCP minimum retransmission timeout (RTOmin ). While the default values operating systems use today may suffice for the wide-area, datacenters and SANs have round trip times that are orders of magnitude below the RTOmin defaults (Table 1). Scenario RTT OS TCP RTOmin WAN 100ms Linux 200ms Datacenter <1ms BSD 200ms SAN <0.1ms Solaris 400ms Table 1: Typical round-trip-times and minimum TCP retransmission bounds.» «FINE-GRAINED RTO How low must the RTO be to retain high throughput under TCP incast collapse conditions, and to how many servers does this solution scale? We explore this question using real-world measurements and ns-2 simulations [26], finding that to be maximally effective, the timers must operate on a granularity close to the RTT of the network—hundreds of microseconds or less.» «Figure 3: Experiments on a real cluster validate the simulation result that reducing the RTOmin to microseconds improves goodput.» «Aggressively lowering both the RTO and RTOmin shows practical benefits for datacenters. In this section, we investigate if reducing the RTOmin value to microseconds and using finer granularity timers is safe for wide area transfers. We find that the impact of spurious timeouts on long, bulk data flows is very low – within the margins of error – allowing RTO to go into the microseconds without impairing wide-area performance.» _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info