Speaking as a co-author from network operator,

After the discussion we have reached the common point listed in my previous 
mail. Put it here again.
Production network running OSPF DOES have some problems due to software 
implementation bugs or hardware defects. Those production network problems 
deserve some proposals both to identify the router with bugs and to mitigate 
the problem, for example to reduce th impact of OSPF route flapping.
We are responsible for the network to run robustly, not for the router with 
bugs.

So, I support this doc to be adopted.



lizhenqi...@chinamobile.com
 
From: Acee Lindem (acee)
Date: 2016-10-12 02:51
To: OSPF WG List
Subject: [OSPF] FW: Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
Speaking as WG Co-Chair:

We had a quite a lengthy discussion on this problem and whether or it is 
something the WG should adopt. Please indicate whether or not you would support 
WG adoption before Oct 26th, 2016.  

Thanks,
Acee 

From: "lizhenqi...@chinamobile.com" <lizhenqi...@chinamobile.com>
Date: Thursday, August 25, 2016 at 9:29 PM
To: Acee Lindem <a...@cisco.com>, Jie Dong <jie.d...@huawei.com>, "Les Ginsberg 
(ginsberg)" <ginsb...@cisco.com>, OSPF WG List <ospf@ietf.org>
Cc: "Zhangxudong (zhangxudong, VRP)" <zhangxud...@huawei.com>
Subject: Re: Re: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement

Hi Acee,

Totally agree with you that we have to avoid significant modification to OSPF. 

The common point after the mail discussion is production network running OSPF 
DOES have some problems due to software implementation bugs or hardware 
defects. Those production network problems deserve some proposals both to 
identify the router with bugs and to mitigate the problem, for example to 
reduce th impact of OSPF route flapping.

Your suggestion is one option about defective router identification. Thank you 
very much.

Best Regards,


lizhenqi...@chinamobile.com
 
From: Acee Lindem (acee)
Date: 2016-08-25 03:04
To: lizhenqi...@chinamobile.com; Dongjie (Jimmy); Les Ginsberg (ginsberg); 
ospf@ietf.org
CC: Zhangxudong (zhangxudong, VRP)
Subject: Re: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
Speaking as WG member:

Hi Zhenjiang,

I don’t doubt that this was a very disquieting experience. However, I still 
don’t think we should attempt to change the protocol to compensate for routers 
that do not adhere to the protocol. To make an analogy, in my years of OSPF 
experience I’ve been subject to a number of bugs related to OSPF’s usage of 
local wire multicast (some triggered by obscure conditions such as routing and 
bridging on the same port). However, I’ve never proposed to not use local wire 
multicast. Also, after 25 years of OSPFv2, it doesn’t make sense to try and 
change the protocol to avoid bugs in this area. As for identifying the 
nefarious router, I think adding a counter and possibly a separate notification 
to the YANG model might be warranted since purging a non-self-originated LSA 
should not be a common occurrence in most networks. 

Thanks, 
Acee
P.S. Since this is an OSPF standards list, I’ve purposely avoided the questions 
as to how this catastrophic bug made it into a production network. 


From: "lizhenqi...@chinamobile.com" <lizhenqi...@chinamobile.com>
Date: Wednesday, August 24, 2016 at 2:11 PM
To: Jie Dong <jie.d...@huawei.com>, Acee Lindem <a...@cisco.com>, "Les Ginsberg 
(ginsberg)" <ginsb...@cisco.com>, OSPF WG List <ospf@ietf.org>
Cc: "Zhangxudong (zhangxudong, VRP)" <zhangxud...@huawei.com>
Subject: Re: RE: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement

Hello Jie, Acee and Les,

I am a coauthor of this draft from operator China Mobile. Thank you all for 
your discussion and suggestion in the previous mails. As you all discussed, a 
misbehavior OSPF router (due to software or hardware problem) can cause severe 
problem in the whole OSPF domain. 

Here I want to point out that OSPF route flapping DID occour in my field 
network contributed by a misbehavior OSPF router installed. The procedure to 
analyze and look for the cause were very complicated because we did not know 
the source of the flushing. Two hours past, we could not identify the real 
cause and restore our network. The CPU utilization of OSPF routers was high, 
the network traffic decreased significantly, lots of tunnel down warnings 
raised. When we tried to shutdown one OSPF router, route flapping stopped. This 
router was a newly deployed one. Through communication with our vendor, they 
admitted that this product had some defects in dealing with OSPF protocol. This 
kind of defects are difficult for us to test  when they apply for entrance in 
our network. Once defective products are deployed in the field network,  
locating the problem is very hard and time consuming. 

So, I think it is necessary for us to solve the problem and improve the 
robustness of the protocol. At least it should provide the means to help us 
locate the OSPF route flapping problem.



lizhenqi...@chinamobile.com
 
From: Dongjie (Jimmy)
Date: 2016-08-18 17:09
To: Acee Lindem (acee); Les Ginsberg (ginsberg); ospf@ietf.org
CC: Zhangxudong (zhangxudong, VRP); lizhenqi...@chinamobile.com
Subject: RE: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
Hi Acee, 
 
Please see my replies inline:
 
From: Acee Lindem (acee) [mailto:a...@cisco.com] 
Sent: Thursday, August 18, 2016 2:23 AM
To: Dongjie (Jimmy); Les Ginsberg (ginsberg); ospf@ietf.org
Cc: Zhangxudong (zhangxudong, VRP); lizhenqi...@chinamobile.com
Subject: Re: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Speaking as a WG member who has some experience with OSPF implementations: 
 
Hi Jie, 
 
Along with Les, I’m also against progressing this draft. 
 
From: Jie Dong <jie.d...@huawei.com>
Date: Tuesday, August 16, 2016 at 9:56 AM
To: Acee Lindem <a...@cisco.com>, "Les Ginsberg (ginsberg)" 
<ginsb...@cisco.com>, OSPF WG List <ospf@ietf.org>
Cc: "Zhangxudong (zhangxudong, VRP)" <zhangxud...@huawei.com>, 
"lizhenqi...@chinamobile.com" <lizhenqi...@chinamobile.com>
Subject: RE: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Hi Acee, 
 
Thanks a lot for your feedbacks.
 
For packet corruption which impacts the LS age before the LSAs are packed into 
LSU packet, I agree it is less likely to happen than the other cases. However I 
think we agree that OSPF authentication only protect the packet level 
corruption, which cannot help to detect the corruption at LSA level.
 
So, you are suggesting that LSAs are corrupted in the database in such a way 
that the LSA Age is set exactly to 0xE10? How would the implementation know 
that this had happened and prematurely age the packet? Database aging just 
doesn’t work this way (unless the implementation is particularly naïve). 
 
[Jie] Actually the case is when the LSA is about to be exchanged with neighbor, 
during the message packing the LS age is corrupted to either Maxage or a large 
number close to Maxage. The sending router does not intend to do a Maxage 
flush, however the neighbor routers which receive the message would treat this 
as a flush. This is a possible case although less likely to happen than the 
other cases.
 
 
In my understanding, robustness is an important feature of network protocols, 
which include the robustness to errors and failures happened in the network. If 
there is a bug in a particular router in the network, operator would not allow 
the whole network being impacted, which means other routers in the network 
needs to work properly in this situation. For example in BGP, the error 
handling mechanism has been optimized to avoid unnecessary session teardown.
 
So you agree your problem statement is confined to a software bug resulting in 
LSAs being aged too quickly? I think this is the third time I’ve raised this 
question. 
                                                                                
                                                                         
[Jie] As I said before, the problems happened in the production network are 
caused by software bug in LSA aging, so I think this is the major case. 
 
If it has such a problem (whether it be due to a system timer bug or a some 
more specific aging problem), it seems the router would also be refreshing its 
LSAs all too frequently (at least at twice the rate) and it would be readily 
identifiable. For a system time problem, the router would likely have many 
other problems. For example, it would not maintain OSPF adjacencies if the dead 
timer advances fast enough. It would retransmit at a very fast rate as well. 
Are you going to write problem statements and suggest solutions for these 
situations as well? 
 
[Jie] This depends on the implementation. the software bug may only impact the 
aging of LSAs received from other routers. And frequent LSA refreshing may be 
caused by other cases such as link oscillation.  For a system timer problem, 
OSPF adjacency may oscillate, but if the management connection is impacted, 
such oscillation is difficult to be identified. 
 
What about other bugs? What if the router erroneously specifies a neighbor’s 
router-id as its own in a Router-LSA? Is this a problem the protocol should 
handle? 
 
[Jie] Depends on the significance to network, case by case analysis may be 
needed. 
 
 
I agree that OSPF Yang notification for LSA timeout is a nice thing to have and 
could be useful to identify the misbehaved router. My concern is sometimes the 
network may be severely impacted that the connectivity of netconf/restconf is 
also impacted. To avoid this, some mechanism to mitigate the impact of this 
problem could help.
 
I believe a router have such impact would be easy to identify… 
 
[Jie] According to the feedback from on-site engineers, when IGP routing is 
oscillating severely which makes the management connection unavailable, it 
usually takes much longer time for troubleshooting, as logging to any router 
cannot be done via the management network. So maybe it would be better to have 
some automatic mechanism to reduce the impact before it becomes a big problem 
to troubleshoot.
 
Best regards,
Jie
 
Thanks,
Acee 
 
 
Best regards,
Jie
 
From: Acee Lindem (acee) [mailto:a...@cisco.com] 
Sent: Saturday, August 13, 2016 3:27 AM
To: Les Ginsberg (ginsberg); Dongjie (Jimmy); ospf@ietf.org
Cc: Zhangxudong (zhangxudong, VRP); lizhenqi...@chinamobile.com
Subject: Re: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Speaking as a WG member: 
 
Hi Jie, 
 
I believe we agree that the problem is confined to OSPF bugs, system timer 
bugs,  and packet corruption. I’d assert that corruption can be detected via 
OSPF authentication. In fact, there is a well-known antidote where IS-IS 
authentication was enabled solely for the purpose of filtering corrupted 
protocol packets in an environment with line cards that were prone to such 
corruption. Hence, we are left with problems based on OSPF or system timer 
bugs. If there were a system timer bug, I’d doubt that networking device with 
such a bug would be functional to the point of being able to establish and 
maintaining OSPF adjacencies.  Do we really want to enhance the protocol to 
deal with bugs? 
 
I’ve thought about this and one potential action I could envision would be to 
add a separate OSPF YANG notification where an LSA times out and a router other 
than the originator purges it. This way, the misbehaving OSPF router could be 
readily identified. 
 
Thanks,
Acee 
 
 
From: OSPF <ospf-boun...@ietf.org> on behalf of "Les Ginsberg (ginsberg)" 
<ginsb...@cisco.com>
Date: Thursday, August 11, 2016 at 1:29 PM
To: Jie Dong <jie.d...@huawei.com>, OSPF WG List <ospf@ietf.org>
Cc: "Zhangxudong (zhangxudong, VRP)" <zhangxud...@huawei.com>, 
"lizhenqi...@chinamobile.com" <lizhenqi...@chinamobile.com>
Subject: Re: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Jie –
 
Having the discussion has certainly been a good thing, but if the consensus of 
the WG is that there is no protocol change required then there is no need for 
any draft – which is my current position.
 
The other point is that you seem to be confusing the IS-IS Purge origination 
TLV (RFC 6232) with detecting invalid purges/remaining lifetime corruption. 
This is not the case. RFC 6232 simply allows us to detect which router 
originated a purge – it is not able to detect whether a purge is valid/invalid 
– and was not motivated by concerns about remaining lifetime corruption.
 
   Les
 
 
From: Dongjie (Jimmy) [mailto:jie.d...@huawei.com] 
Sent: Wednesday, August 10, 2016 9:24 PM
To: Les Ginsberg (ginsberg); ospf@ietf.org
Cc: Zhangxudong (zhangxudong, VRP); lizhenqi...@chinamobile.com
Subject: RE: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Hi Les,
 
The current draft is about problem statement, so IMO what the WG needs to 
consider is whether this is a vulnerability of OSPF protocol, and whether it 
can have negative impact to the network. If the problem is acknowledged, IMO it 
is worth to be documented.
 
The “ROI” as you mentioned is for the evaluation of the proposed solutions. I 
totally agree that for the timer bug case, recognizing and ignoring the 
received abnormal Maxage LSAs cannot stop the misbehaved router from generating 
further Maxage LSA, as it is a systematic problem, which can only be fixed 
after the operator identifies that router. This is also similar to the 
systematic corruption of IS-IS remain time.  And this is why this draft 
mentions two kinds of potential solutions, the mitigation mechanism can avoid 
the network being severely impacted by the problem, while for systematic 
problems, problem localization is needed to identify the misbehaved router and 
then solve the problem.
 
Best regards,
Jie
 
From: OSPF [mailto:ospf-boun...@ietf.org] On Behalf Of Les Ginsberg (ginsberg)
Sent: Monday, August 08, 2016 2:14 AM
To: Dongjie (Jimmy) <jie.d...@huawei.com>; ospf@ietf.org
Cc: Zhangxudong (zhangxudong, VRP) <zhangxud...@huawei.com>; 
lizhenqi...@chinamobile.com
Subject: Re: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Jie –
 
Thinking about the following some more:
 
<snip>
What remains is the possibility that an implementation has some bug and 
unintentionally modifies the age to something other than what it should be due 
to the actual elapsed time since LSA generation. I suppose a mechanism 
equivalent to what the IS-IS draft defined i.e. setting the age to “new” (0 in 
OSPF case) when first receiving a non-self-generated LSA could be useful to 
prevent negative impacts of such an implementation bug. Is this what you intend?
 
[Jie]: More specifically, the problem could be caused by either “setting the LS 
age field incorrectly due to implementation bug” or “system timer runs so fast 
that the LS age reaches MaxAge much earlier than other routers”. Another less 
likely case is that the LS age field is corrupted before the LSA is assembled 
into OSPF packet.
<end snip>
 
The benefits are extremely limited. If a router prematurely ages an LSA due to 
a timer bug, ignoring the received LSA age on reception isn’t going to prevent 
premature purging by the router which has the bug. So the effect of ignoring 
the received LSA age prior to reaching MAXAGE will be short lived. You are then 
left with the possibility that an implementation corrupts the LSA age BEFORE 
calculating checksum/crypto authentication – but its local timeout logic is 
unaffected. This has very limited value. Whether the WG considers this worth 
pursuing is something you need to ask. For myself, I don’t see much ROI here.
 
  Les
 
 
 
From: Dongjie (Jimmy) [mailto:jie.d...@huawei.com] 
Sent: Monday, August 01, 2016 9:43 PM
To: Les Ginsberg (ginsberg); ospf@ietf.org
Cc: Zhangxudong (zhangxudong, VRP); lizhenqi...@chinamobile.com
Subject: RE: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Hi Les, 
 
Please see my replies with [Jie2]:
 
From: Les Ginsberg (ginsberg) [mailto:ginsb...@cisco.com] 
Sent: Monday, August 01, 2016 9:57 PM
To: Dongjie (Jimmy); ospf@ietf.org
Cc: Zhangxudong (zhangxudong, VRP); lizhenqi...@chinamobile.com
Subject: RE: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Jie -
 
From: Dongjie (Jimmy) [mailto:jie.d...@huawei.com] 
Sent: Monday, August 01, 2016 1:44 AM
To: Les Ginsberg (ginsberg); ospf@ietf.org
Cc: Zhangxudong (zhangxudong, VRP); lizhenqi...@chinamobile.com
Subject: RE: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Hi Les,
 
Please see inline with [Jie]:
 
From: Les Ginsberg (ginsberg) [mailto:ginsb...@cisco.com] 
Sent: Monday, August 01, 2016 3:09 PM
To: Dongjie (Jimmy); ospf@ietf.org
Cc: Zhangxudong (zhangxudong, VRP); lizhenqi...@chinamobile.com
Subject: RE: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Jie –
 
Fully agree that IS-IS and OSPF differ in this regard.
 
https://www.ietf.org/id/draft-ietf-isis-remaining-lifetime-01.txt addresses 
problems where corruption of the remaining lifetime occurs either during 
transmission/reception or due to some DOS attack. This isn’t a concern w OSPF 
(hope you agree).
 
[Jie]: Yes, for OSPF the corruption during packet transmission can be detected.
 
What remains is the possibility that an implementation has some bug and 
unintentionally modifies the age to something other than what it should be due 
to the actual elapsed time since LSA generation. I suppose a mechanism 
equivalent to what the IS-IS draft defined i.e. setting the age to “new” (0 in 
OSPF case) when first receiving a non-self-generated LSA could be useful to 
prevent negative impacts of such an implementation bug. Is this what you intend?
 
[Jie]: More specifically, the problem could be caused by either “setting the LS 
age field incorrectly due to implementation bug” or “system timer runs so fast 
that the LS age reaches MaxAge much earlier than other routers”. Another less 
likely case is that the LS age field is corrupted before the LSA is assembled 
into OSPF packet.
 
[Jie]: Regarding the solutions space, IMO we need to consider both cases: “LS 
age reaches MaxAge” and “LS age close to MaxAge”. For IS-IS, RFC 6232 and RFC 
6233 provide solutions for the detection and identification of corrupted IS-IS 
purge, while OSPF does not have similar mechanisms.
 
[Les:] It is incorrect to say that RFC 6232 makes it possible to detect a 
corrupt purge. What it does do is to provide an indication as to which IS 
initiated a purge. I don’t know how OSPF would address this issue, but for 
OSPFv2 at least any solution would likely not be backwards compatible. For this 
reason I suggest that you not try to address this issue in the same draft.
 
[Jie2]: Agreed, RFC 6232 provide the mechanism to track the misbehaved routers 
so that operator can fix the problem, the detection can be based on the rules 
in RFC 6233 or some other anomalies. Indeed for OSPFv2 legacy LSAs, it is 
difficult to introduce the mechanism similar to RFC 6232, while it can be 
easier for the OSPFv2/v3 Extended LSAs. So it depends on how backward 
compatible the solution should be. I agree with you that the solution for 
Problem Localization in OSPF needs to be provided in a separate document.
 
Solutions to LS age  corruption can be done in a backwards compatible way, but 
they  MUST NOT result in discarding purges which pass authentication- doing so 
places you at risk for having inconsistent LSDBs in the network.
 
[Jie2]: Exactly. The received MaxAge LSAs cannot simply be discarded, the 
decision must be made carefully, probably based on some additional information. 
The authors has discussed some possible solution internally, and will prepare 
some material for further open discussion.
 
As written, the draft makes claims that are at least misleading – and I believe 
actually incorrect. In Section 6 you say:
 
“The LS age field may be altered as a result of
   packet corruption, such modification cannot be detected by LSA
   checksum nor OSPF packet cryptographic authentication.”
 
This isn’t correct.
 
[Jie] Thanks for pointing out this. This sentence need to be revised to mention 
“LSA corruption” rather than “packet corruption”.
 
What would be helpful – at least to me – is to move from a generic problem 
statement to the specific problem you want to solve and the proposed solution. 
This also requires you to more clearly state the cases where there is an actual 
vulnerability. It would be a lot easier to support the draft if this were done.
 
[Jie] Thanks for your suggestion. Yes we can update this draft with more 
specific problem statements as I mentioned above. 
 
[Jie] As for the proposed solutions, the current draft specifies the 
requirements on the potential solutions, from which we envision that different 
solutions maybe needed for “Impact Mitigation” and “Problem Localization”. The 
solution for “Impact mitigation” can be the easier one, for which we can start 
to discuss the potential solutions now. While the solution for “problem 
localization” may need more considerations.
 
[Les:] A discussion of the requirements is useful and necessary, but IMO until 
you propose a solution there isn’t enough substance for the document to become 
a WG document.
 
[Jie2] Yes the current draft focuses on the problem statement and the 
requirements, the goal is to firstly get the MaxAge flush problem acknowledged 
and reach consensus on the requirements. Then the plan is to specify the 
solutions in separate documents.  Your valuable suggestions will be considered, 
and further contributions are welcome.
 
Best regards,
Jie
 
    Les
 
Best regards,
Jie
 
   Les
 
 
From: Dongjie (Jimmy) [mailto:jie.d...@huawei.com] 
Sent: Sunday, July 31, 2016 11:48 PM
To: Les Ginsberg (ginsberg); ospf@ietf.org
Cc: Zhangxudong (zhangxudong, VRP); lizhenqi...@chinamobile.com
Subject: RE: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Hi Les, 
 
Thanks for your comments.
 
OSPF packet level checksum and authentication can only protect the assembled 
LSU packet one hop on the wire, while cannot detect any change to LSA made by 
the routers. This is because the OSPF packets are re-assembled on each hop, 
which is slightly different from IS-IS. So the problem for OSPF is mainly due 
to the problems inside the router, for example protocol implementations, system 
timers, or some hardware problem. Actually this problem has been seen in 
several production networks.
 
We can improve the description in the draft to make this clear.
 
Best regards,
Jie
 
From: Les Ginsberg (ginsberg) [mailto:ginsb...@cisco.com] 
Sent: Monday, August 01, 2016 1:30 PM
To: Dongjie (Jimmy); ospf@ietf.org
Cc: Zhangxudong (zhangxudong, VRP); lizhenqi...@chinamobile.com
Subject: RE: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Jie –
 
The draft says (Section 2):
 
“Since cryptographic authentication is executed at the OSPF packet
   level, it can only protect the assembled LSU packet for one hop and
   does not provide any additional protection for the corruption of LS
   age field.”
 
But as authentication is calculated at the OSPF packet level, any change to the 
LS age field for an individual LSA contained within the OSPF packet (e.g. by 
some packet corruption in transmission) would cause authentication to fail when 
the packet is received. So the statement you make is not correct. I therefore 
am struggling to understand what problem you believe is not addressed by 
existing authentication techniques.
 
   Les
 
 
 
From: OSPF [mailto:ospf-boun...@ietf.org] On Behalf Of Dongjie (Jimmy)
Sent: Sunday, July 31, 2016 8:15 PM
To: ospf@ietf.org
Cc: Zhangxudong (zhangxudong, VRP); lizhenqi...@chinamobile.com
Subject: [OSPF] Solicit feedbacks on 
draft-dong-ospf-maxage-flush-problem-statement
 
Hi all,
 
draft-dong-ospf-maxage-flush-problem-statement describes the problems caused by 
the corruption of the LS Age field, and summarizes the requirements on 
potential solutions. This draft received good comments during the presentation 
on the IETF meeting in B.A.
 
The authors would like to solicit further feedbacks from the mailing list, on 
both the problem statement and the solution requirements. Based on the 
feedbacks, we will update the problem statement draft, and work together to 
build suitable solutions. 
 
The URL of the draft is:
https://tools.ietf.org/html/draft-dong-ospf-maxage-flush-problem-statement-00
 
Comments & feedbacks are welcome.
 
Best regards,
Jie
 
_______________________________________________
OSPF mailing list
OSPF@ietf.org
https://www.ietf.org/mailman/listinfo/ospf

Reply via email to