Re: [OpenAFS] DB servers "quorum" and OpenAFS tools

Neil Davies Fri, 24 Jan 2014 08:27:23 -0800

Peter

To solve this you can't just use the round trip in its raw form, you need 
to understand it terms of how the "delay and loss" accrued.


Its a bit too long (and potentially off-topic) for this list, but briefly the 
way we
perform this sort of analysis (in my day job) is to view it as quality
attenuation (short hand ∆Q) - this can be split into a set of bases that
both permit evaluation of the structural component of the delay/loss 
(geography, path length and their serialisation rates) and the variable part 
(which is basically due to contention).

This, in turn can be used to estimate setting of a timeout - if the congestion
has ONLY introduced delay any additional sending will just increase it. 
If the congestion has introduced loss (or loss as occurred for another reason)
then it is worth sending it. Using the ∆Q approach it is possible to assess
both the false-postive (resending a packet when it was just delayed)
and performance (waiting "too long" to retransmit given a packet has been lost)
hit for a protocol.

If you combine this approach with the collective knowledge of other connections
to/from that service (which can happen in - at least in principle and
to some extent in - Rx, but not really in TCP) then it suggests
that you could create a more optimum solution - one that recovers from loss
reasonably quickly without creating too much additional load.

Neil

If interested, there is a brief introduction to these ideas in 
http://www.slideshare.net/mgeddes/advanced-network-performance-measurement 

On 23 Jan 2014, at 21:55, Peter Grandi <p...@afs.list.sabi.co.uk> wrote:

>>>> For example in an ideal world putting more or less DB servers
>>>> in the client 'CellServDB' should not matter, as long as one
>>>> that belongs to the cell is up; again if the logic were for
>>>> all types of client: "scan quickly the list of potential DB
>>>> servers, find one that is up and belongs to the cell and
>>>> reckons is part of the quorum, and if necessary get from it
>>>> the address of the sync site".
> 
>> The problem is that you the client to scan "quickly" to find a
>> server that is up, but because networks are not perfectly
>> reliable and drop packets all the time, it cannot know that a
>> server is not up until that server has failed to respond to
>> multiple retransmissions of the request.
> 
> That has nothing to do with how quickly the probes are sent...
> 
>> Those retransmissions cannot be sent "quickly"; in fact, they
>> _must_ be sent with exponentially-increasing backoff times.
> 
> That has nothing to do with how "quickly" they can be sent...  The
> duration of the intervals betwen the probes is a different matter
> from what should be the ratio of intervals.
> 
>> Otherwise, when your network becomes congested, the
>> retransmission of dropped packets will act as a runaway positive
>> feedback loop, making the congestion worse and saturating the
>> network.
> 
> I am sorry I have not been clear about the topic: I was not
> meaning to discussing flow control is back-to-back streaming
> connections, my concern was about the frequency of *probing*
> servers for accessibility.
> 
> Discovering the availability of DB servers is not the same thing
> as streaming data from/to a fileserver, both in nature and as to
> amount of traffic involved. In TCP congestion control for example
> one could be talking about streams of 100,000x 8192B packets per
> second. DB database discovery 
> 
> But even if I had meant to discuss back-to-back streaming packet
> congestion control, the absolute numbers are still vastly
> different. In the case of *probing* for the liveness of a *single*
> DB server I have observed the 'vos' command send packets with
> these intervals:
> 
>  «The wait times after the 8 attempts are: 3.6s, 6.8s, 13.2s,
>  21.4s, 4.6s, 25.4s, 26.2s, 3.8s.»
> 
> with randomish variations around that, That's around 5 packets per
> minute with intervals between 3,600ms and 26,200ms. Again, to a
> single DB server, not say roundrobin to all DB servers in
> 'CellServDB'.
> 
> With TCP congestion control back of (the 'RTO' parameter) for
> 200ms (two hundreds milliseconds). With another rather different
> distributed filesystem, Lustre, I observed some issue with that
> very long backoff time, with high throughput (600-800MB/s)
> back-to-back packet streams, and there is a significant amount of
> research that on fast low latency links 200ms RTO seems way
> excessive.
> 
> For example in a paper that is already 5 years old:
> 
>  http://www.cs.cmu.edu/~dga/papers/incast-sigcomm2009.pdf
> 
>    «Under severe packet loss, TCP can experience a timeout that
>    lasts a minimum of 200ms, determined by the TCP minimum
>    retransmission timeout (RTOmin ).
> 
>    While the default values operating systems use today may
>    suffice for the wide-area, datacenters and SANs have round
>    trip times that are orders of magnitude below the RTOmin
>    defaults (Table 1).
> 
>    Scenario   RTT     OS      TCP RTOmin
>    WAN                100ms   Linux   200ms
>    Datacenter <1ms    BSD     200ms
>    SAN                <0.1ms  Solaris 400ms
> 
>    Table 1: Typical round-trip-times and minimum
>    TCP retransmission bounds.»
> 
>    «FINE-GRAINED RTO
> 
>    How low must the RTO be to retain high throughput under TCP
>    incast collapse conditions, and to how many servers does this
>    solution scale? We explore this question using real-world
>    measurements and ns-2 simulations [26], finding that to be
>    maximally effective, the timers must operate on a granularity
>    close to the RTT of the network—hundreds of microseconds or
>    less.»
> 
>    «Figure 3: Experiments on a real cluster validate the
>    simulation result that reducing the RTOmin to microseconds
>    improves goodput.»
> 
>    «Aggressively lowering both the RTO and RTOmin shows practical
>    benefits for datacenters. In this section, we investigate if
>    reducing the RTOmin value to microseconds and using finer
>    granularity timers is safe for wide area transfers.
> 
>    We find that the impact of spurious timeouts on long, bulk
>    data flows is very low – within the margins of error –
>    allowing RTO to go into the microseconds without impairing
>    wide-area performance.»
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info

_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] DB servers "quorum" and OpenAFS tools

Reply via email to