[Lsr] Comments on draft-prz-lsr-hierarchical-snps-01

Les Ginsberg (ginsberg) Tue, 10 Mar 2026 22:28:09 -0700

Draft authors -

While I continue to have reservations as to whether this is a good solution for 
addressing scale issues with CSNPs, I appreciate that this is a serious effort 
on your part and therefore want to provide feedback.


I also appreciate the additional content and clarifications provided by V1 and 
that the document is still in relatively early stages. I expect some things are 
likely planned to be addressed in subsequent revisions.

But I preface my remarks by saying that the document needs to be more precise 
as regards the specification of the new PDUs and new TLVs you wish to define. 
In its current form, these elements are only "described" - but not "specified". 
In some cases, which I will comment on below, this leads to 
uncertainty/confusion. I hope future revisions will be much more pedantic in 
this regard.

Also, if the goal is to use HSNPs to achieve the same level of reliability that 
is achieved today using CSNPs, more detailed behavioral specification is 
required. Actions regarding sending/acking LSPs related to the 
sending/receiving of CSNPs are fully specified in ISO 10589. HSNPs introduce 
new behaviors - but the end goal is the same - to ensure that LSPDB 
synchronization is maintained. I think a more precise definition of how an 
implementation tracks the state of the portion of the LSPDB associated with an 
HSNP hash mismatch is required to guarantee reliability and interoperability. I 
am not suggesting that the solution you define cannot work - just that it needs 
a more precise behavioral description. Hopefully, that is coming in future 
revisions.

Section 2:

You say:

"At the lowest compression level, it is optimal to generate a single CSNP 
packet on a mismatch in a hash. To achieve this, the first-level hashes should 
initially group about 80 LSP fragments together, with exceptions handled later. 
There is no need to maximize this initial packing."

and

"The packing process always places all fragments belonging to the same system 
and its pseudonodes within a single node Merkle hash. This hash may 
occasionally exceed the recommended size of 80 fragments..."

This is confusing.
I think what you mean to say here is that it is not helpful to pack beyond the 
number of hashes which will fit in a single HSNP PDU. (Approximately 80 for a 
1500 byte MTU). But if a given node is originating 200 LSPs, there is no way to 
split the hash calculation for that node into two HSNP TLVs - and so it may 
indeed require more than one CSNP to determine which of the 200 LSPs is "out of 
sync" in the event of a hash mismatch.

Section 3

Not sure why you went to a 48 bit Fletcher checksum.
I don't object - but it makes the bar to deployment/interoperability slightly 
higher since implementations cannot simply use the fletcher calculation they 
have been using for decades. Could you provide a clearer justification?
I appreciate that you have provided sufficient info for implementations to 
validate that they have implemented the modified fletcher checksum correctly.

Section 5.1

You have yet to define the new TLV you require in hellos.

Section 5.2

It seems the intent is to interleave CSNPs and HSNPs (though not insisted 
upon). But the actions to take on receiving a hash mismatch are not fully 
specified.
Ultimately, we have to guarantee synchronization of the LSPDB - which means 
setting/clearing of SRM/SSN and related behaviors in response to HSNP reception 
needs to be specified.

Section 6

Is the header of an HSNP intended to be identical to the header of a CSNP?
I ask because the following fields in the CSNP PDU header are of length "ID 
Length +2":

Start LSP ID
End LSP ID

but since the new TLV you define uses range identifiers which are simply System 
IDs (NOT LSP IDs), it is not possible to send an HSNP which covers only some of 
the LSPs generated by a given node. This suggests that you could modify the 
Start/End LSP ID fields in the HSNP PDU header to match what you have in the 
new TLV.
If you don't do that, then you will need to state that HSNPs which have 
Start/End LSP IDs which are not of the form "A.00-00" and "B.FF-FF" 
respectively are invalid.

Figure 2 and Figure 3 seems to hint at this - but it isn't explicit.

Also, I assume you will be defining Level 1 and Level 2 HSNP PDUs?

You say:

"The Start and End System IDs exclude pseudonode bytes, as those are implicitly 
included within the ranges."

I think what you mean to say is:

"The Start and End Range IDs exclude pseudonode and LSP number octets, as those 
are implicitly included within the ranges."

Section 8

You say:

"thus we focus on realistic scenarios in the order of 50,000 nodes and 1 
million fragments."

Assuming use of the maximum LSP lifetime (65535 seconds) and a commonly used 
LSP refresh time of 65000 seconds, the expected number of LSPs being refreshed 
at that scale is about 15/second. Any of these LSPs may be transiently out of 
sync not because of a flooding issue but simply because LSP flooding for those 
LSPs is "in progress" at the time the HSNP is generated/transmitted/received. 
There may also be additional LSP updates triggered by topology changes which 
are in the process of being synchronized. This leads to a significant 
probability of transient/temporary hash mismatches which actually require no 
handling - but of course it is difficult at best to determine whether a hash 
mismatch is transient or persistent.

When a hash mismatch occurs, there are three actions available:

1)Generate an additional HSNP covering the original range where the mismatch 
was detected, but this time with greater granularity
2)Generate CSNP(s) for the LSPs in the range where the mismatch was detected
3)Mark all the LSPs in the original range to be flooded

It would be good to have an analysis of the impact of such transient mismatches 
on the overall efficiency of the HSNP solution.
Intuitively, the frequency of transient hash mismatches seems likely to 
increase as the size of the LSPDB increases.


Section 9.2

You spend several paragraphs discussing the case of:

"if a new fragment has the same sequence number and different content but an 
identical 16-bit Fletcher checksum" to an older LSP which exists in LSPDB of 
nodes in the network.

We have discussed this at length previously - and we all agree that this is an 
existing vulnerability in the protocol - though the probability of its 
occurrence (as you have calculated) is extremely low and even then, confined to 
time windows shortly after a node has restarted.

This is a vulnerability associated with LSP generation.
It is not introduced by CSNPs - nor by HSNPs.
It is not detected by CSNPs - nor by HSNPs.
It is not correctable by CSNPs - nor by HSNPs.
And you are not proposing a means of resolving this vulnerability in the draft.

So I wonder why this discussion is included in the draft?

***

Finally, I mention a suggestion that I may have made previously.

Rather than define a new PDU, you could simply introduce a new TLV into 
existing CSNPs. This might have advantages when you detect an HSNP hash 
mismatch and are taking steps to isolate the impacted LSPs. Rather than sending 
HSNPs and CSNPs you could send CSNPs with a mixture of TLVs - which might 
reduce the total number of PDUs sent in order to resolve the hash mismatches.

Thanx very much for your consideration of these comments.

    Les

_______________________________________________
Lsr mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Lsr] Comments on draft-prz-lsr-hierarchical-snps-01

Reply via email to