> Saku Ytti
> Sent: Tuesday, November 17, 2020 6:55 AM
>
> On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sa...@cluecentral.net> wrote:
>
> Hey Sabri,
>
> > Also, in the case that I described it wasn't a Junos device. Makes me
> > wonder how bugs like that get introduced. One would expect that after
> > 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.
>
> I don't think this is related to skill, that there was some hard programming
> problem that DE couldn't solve. These are honest mistakes.
> I've not experienced in my tenure the frequency of these bugs change at all,
> NOS are as common now as they were in the 90s.
>
> I put most of the blame on the market, we've modelled commercial router
> market so that poor quality NOS is good for business and good quality NOS is
> bad for business, I don't think this is in anyone's formal business plan or
> that
> companies even realise they are not even trying to make good NOS. I think it's
> emergent behaviour due to the market and people follow that market demand
> unknowingly.
> If we suddenly had one commercial NOS which is 100% bug free, many of their
> customers would stop buying support, would rely on spare HW and Internet
> forums for configuration help. Lot of us only need contracts to deal with
> novel
> bugs all of us find on a regular basis, so good NOS would immediately reduce
> revenue. For some reason Windows, macOS or Linux almost never have novel
> bugs that the end user finds and when those are found, it's big news. While we
> don't go a month without hitting a novel bug in one of our NOS, and no one
> cares about it, it's business as usual.
>
> I also put a lot of blame on C, it was a terrific language when compiling had
> to
> be fast. Basically macro assembler. Now the utility of being 'close to HW' is
> gone, as the CPU does so much C compiler has no control over, it's not really
> even executing the same code as-written anymore. MSFT estimated >70% of
> their bugs are related to memory safety. We could accomplish significant
> improvements in software quality if we'd ditch C and allow the computer to do
> more formal correctness checks at compile time and design languages which
> lend towards this.
>
>
> We constantly misattribute problems (like in this post) to config or HW, while
> most common reasons for outages are pilot error and SW defect, and very little
> engineering time is spent on those. And often the time spent improving the two
> first increases the risk of the two latter, reducing mean availability over
> time.
>
I agree with everything but the last statement.
>From my experience, most of the SPs spend a considerable time testing for SW
>defects on features (and combinations of features) that will be used and at
>scale intended, that's how you identify most of the bugs. What you're left
>with afterwards are special packets of death or some slow memory leaks
>(basically the more exotic stuff).
adam