Hi Les,
Please see inline. > On Oct 6, 2025, at 9:44 AM, Les Ginsberg (ginsberg) - ginsberg=40cisco.com at > dmarc.ietf.org <[email protected]> wrote: > > Tony – > > Regarding the old thread from July 19, here is an excerpt of the exchange > between Tony Li and myself: > > <snip> > > [LES:] Let’s use a very simple example. > > A and B are neighbors > For LSPs originated by Node C here is the current state of the LSPDB: > > A has (C.00-00(Seq 10), C.00-01(Seq 8), C-00.02(Seq 7) Merkle hash: 0xABCD > B has (C.00-00(Seq 10), C.00-01(Seq 9), C-00.02(Seq 6) Merkle hash: 0xABCD > (unlikely that the hashes match - but possible) > > When A and B exchange hash TLVs they will think they have the same set of > LSPs originated by C even though they don’t. > They would clear any SRM bits currently set to send updated LSPs received > from C on the interface connecting A-B. > We have just broken the reliability of the update process. > > [Tony Li}:By that metric, the update process has always been unreliable. All > it takes is two LSPs with different contents and the same checksum. This > breaks CSNPs. As Tony P. has said, we are now very much into the realm of > stochastic processes. CSNPs work in practice because the odds of a collision > are quite small. The HSNP approach carries that forward. > > The analogy of the use of fletcher checksum on PDU contents is not a good > one. The checksum allows a receiver to determine whether any bit errors > occurred in the transmission. If a bit error occurs and is undetected by the > checksum, that is bad – but it just means that a few bits in the data are > wrong – not that we are missing the entire LSP. > > > <end snip> > > Given that we all have “grey hair”, I did not think it necessary to further > clarify – but perhaps I should have. > ISO 10589 Section 7.3.16 – and especially Section 7.3.16.2 – is relevant here. If you do not reply, then we assume that you have seen the error of your ways and subsequently agree. Silence is assent. > If, as is suggested, we have two LSPs with same source ID and same sequence > number but different checksums the procedures defined in ISO 10589 7.3.16.2 > will result in the LSP in question either getting purged or regenerated with > a higher sequence number (depending on whether the LSP in question is not > owned/owned by the system which detects the inconsistency). This results in > proper synchronization of the LSPDB. You have the a different situation. If you read carefully, I am suggesting the case where the checksums are the same, not different. This can happen because the Fletcher checksum, like any hashing function, has hash collisions. They’re rare, but they can happen. This means that the legacy mechanism is NOT 100% reliable, just close to it. I will happily stipulate that it is Super Good Enough. > My point is that in an HSNP, since you no longer have the individual LSP > descriptions but just a summary hash – any collision means you will not > detect the inconsistency and therefore not take any steps to properly > synchronize the databases. That is correct. What are the odds of that? We claim that they are extremely low and thus hashing is still Super Good Enough. If you feel that our hashing algorithm is insufficiently strong, we would welcome that debate. > I see no reason why there should be any disagreement on this point. > > You might find the low probability of this occurring “acceptable”. I do not – > which is my main point. You have a “low probability” of collisions today. We are not proposing to increase the probability of error. Therefore, we do not understand your objection. Regards, Tony > > Les > > > > From: Tony Przygienda <[email protected] <mailto:[email protected]>> > Sent: Monday, October 6, 2025 4:49 AM > To: Les Ginsberg (ginsberg) <[email protected] > <mailto:[email protected]>> > Cc: [email protected] > <mailto:[email protected]>; lsr <[email protected] > <mailto:[email protected]>> > Subject: [Lsr] Re: Further Comments on draft-prz-lsr-hierarchical-snps > > Since I don't see anything substantial new in this email and you did not > respond to Jul 19 > thread refuting your largely same arguments (especially your #1 seems simply > incorrect, CSNP will _never_ detect CSUM collision if seqnr is the same as > Tony Li pointed out and hence the whole argument 'it's unreliable with this > but is was reliable before' is a fallacy, especially once you do some > probabilities math). Hence I allow myself to ignore this thread. As a mildly > acerbic note, _enormous_ amount of data is being sync'ed using Merkle trees > in things like Dynamo DB that is holding up most of amazon. > > As to #2 and #3 it is up to the implementation to use such strategies and > they don't need standardization though AFAIS a section of 'further > considerations' could include such advice albeit I doubt its value, e.g. > bringing up an adjacency with a huge database (where this spec is aiming at) > where 99% of the database is already the same (flap) can massively benefit > the scenario. > > thanks > > -- tony > > On Mon, Oct 6, 2025 at 6:15 AM Les Ginsberg (ginsberg) > <[email protected] <mailto:[email protected]>> > wrote: > At a high level, this is the way I am viewing this draft. > > > > IS-IS has 100% reliable flooding defined in the base specification. > > As scale requirements increase, there has been an interest in optimizing > flooding. See: > > > > https://datatracker.ietf.org/doc/rfc9667/ > > https://datatracker.ietf.org/doc/draft-ietf-lsr-isis-flood-reduction-arch/ > > > > Flooding optimizations introduce the possibility that the reliability of > flooding may be compromised. Therefore, the use of CSNPs is of interest as > this can restore the > > 100% reliable flooding guarantee while allowing flooding optimizations to be > deployed. However, at scale, the size of a complete set of CSNPs can become > large, so there is interest in finding a way to reduce the cost of using > CSNPs. > > > > This draft is a proposal which can significantly reduce the number and size > of PDUs required to convey a summary of the state of the LSPDB (which is what > CSNPs do today). However, it does not provide a guarantee that all LSPDB > > inconsistencies will be detected. > > > > I do not believe we are (or should be) in the business of defining solutions > which work "most of the time". > > > > I cannot support this proposal. > > > > Below are specific issues I see in the proposed solution. But my objections > are fundamentally about not being able to provide 100% reliability. > Addressing the issues below will not alter my opinion unless it also provides > 100% reliability. > > > > > > Issue #1: Unreliability > > > > The draft proposes to use a simple hash to summarize the state of a range > > of LSPs. The possibility of "hash collision" is not insignificant. When it > > occurs it will be undetectable - which compromises the reliability of the > > Update process. > > > > It has been mentioned that even the existing PDU checksum mechanism used by > IS-IS > > (fletcher) can produce collisions - which is true. But in such a case, the raw > > data is still present in the PDU and can be used to detect LSPDB > inconsistencies > > even in the presence of a checksum collision. In the HSNP proposal, because > > only a summary of the data is present it is not possible to detect or recover > from a hash collision. > > > > Issue #2: Solution becomes less useful in the presence of LSPDB Differences > > > > The choice of system ID ranges to advertise in the HSNP is optimized for > > cases where the neighbors LSPDBs are mostly in sync. In the case of an > > established adjacency, this is likely to be true. But in the case of adjacency > > bringup this is less likely. > > > > If one neighbor has LSPs from nodes A, B, C and the other neighbor has not yet > > received any LSPs from B, then the choice of a system ID range greater than 1 > > is likely to trigger a hash mismatch and result in either flooding of > > LSPs from all nodes unnecessarily or require reversion to traditional CSNPs. > > > > This makes the solution unusable in the case of adjacency bringup - which is > a case also worthy of optimization. A good solution to this issue should be > usable both for adjacency bringup and periodic CSNPs. > > > > Issue #3: The solution degrades as scale (size of the LSPDB) increases > > > > When the LSPDBs are mismatched the > > likelihood of hash mismatches increases. Even in a stable network, there is a > > base level of LSP refresh flooding that occurs. Assuming an LSP lifetime of > > 65535 seconds and an LSP refresh time of 65000 seconds we can expect > > a base level of LSP updates as shown below: > > > > Size of LSPDB Average LSP flooding rate > > ------------------------------------------- > > 1000 1 LSP/65 seconds > > 10000 1 LSP/6.5 seconds > > 20000 1 LSP/3.25 seconds > > ... > > > > This means as scale increases, the likelihood that hash mismatches will occur > > increases. Even in the absence of any LSP flooding pathology this is likely > > to trigger redundant LSP flooding or a reversion to > > traditional SNPs. > > > > To overcome this, one could imagine a strategy that suppresses HSNPs when > > SRM bits are currently set on an interface - but as one of the primary use > > cases for HSNPs is in the presence of flooding optimizations where flooding > > is intentionally suppressed on some interfaces that strategy will not be > > applicable in such cases. > > > > Les > > > > _______________________________________________ > Lsr mailing list -- [email protected] <mailto:[email protected]> > To unsubscribe send an email to [email protected] > <mailto:[email protected]>_______________________________________________ > Lsr mailing list -- [email protected] <mailto:[email protected]> > To unsubscribe send an email to [email protected] <mailto:[email protected]>
_______________________________________________ Lsr mailing list -- [email protected] To unsubscribe send an email to [email protected]
