Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Vanbever Laurent Thu, 08 Jul 2021 06:14:21 -0700

> On 8 Jul 2021, at 14:29, Saku Ytti <[email protected]> wrote:
> 
> On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent <[email protected]> wrote:
> 
>> Detecting whole-link and node failures is relatively easy nowadays (e.g., 
>> using BFD). But what about detecting gray failures that only affect a 
>> *subset* of the traffic, e.g. a router randomly dropping 0.1% of the 
>> packets? Does your network often experience these gray failures? Are they 
>> problematic? Do you care? And can we (network researchers) do anything about 
>> it?”
> 
> Network experiences gray failures all the time, and I almost never
> care, unless a customer does. If there is a network which does not
> experience these, then it's likely due to lack of visibility rather
> than issues not existing.
> 
> Fixing these can take months of working with vendors and attempts to
> remedy will usually cause planned or unplanned outages. So it rarely
> makes sense to try to fix as they usually impact a trivial amount of
> traffic.


Thanks for chiming in. That's also my feeling: a *lot* of gray failures 
routinely happen, a small percentage of which end up being really damaging (the 
ones hitting customer traffic, as you pointed out). For this small percentage 
though, I can imagine being able to detect / locate them rapidly (i.e. before 
the customer submit a ticket) would be interesting? Even if fixing the root 
cause might take up months (since it is up to the vendors), one could still 
hope to remediate to the situation transiently by rerouting traffic combined 
with the traditional rebooting of the affected resources?

> Networks also routinely mangle packets in-memory which are not visible
> to FCS check.

Added to the list... Thanks!

Best,
Laurent

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Reply via email to