So I ran fletcher instead of the double hashing I was toying around with.
Fletcher shows about same distribution under low entropy scenarios (i.e.
assuming checksums vary by very few bits which is not very likely but then
again) and a more even and steeper distribution under normal entropy
scenarios (i.e. fragment checksums are relatively uniformly distributed) so
AFAIS we should cut over, as Tony Li says, to Fletcher to generate single
fragment hashes that way as a better alternative. Below the graphs.
Double Hashing
13.38 ┤ ╭╮ ╭╮ ╭╮
12.35 ┤ ╭╮ ││ ││ ││ ╭╮
11.32 ┤ ╭╮ ││ ││ ││ ││ ││ ╭╮
10.29 ┤ ╭╮ ││ ││ ││ ││ ││ ││ ││
9.26 ┤ ││ ││ ││ ││ │╰╮ │╰╮ │╰╮ ││ ╭╮
8.23 ┤ ╭╮ ││ ││ ││ │╰╮ │ │ │ │ │ │ │╰╮ ││
7.20 ┤ ││ ││ ││ │╰╮ │ │ │ │ │ │ │ │ │ │ │╰╮ ╭╮
6.17 ┤ ││ │╰╮ │╰╮╭╯ │╭╯ │╭╯ │╭╯ │ │ │ │ │ │ │ │╰╮
5.15 ┤ ││ ╭╯ │╭╯ ││ ││ ││ ││ │╭╯ │╭╯ │ │ │ │ │ ╭╮
4.12 ┤ ╭╯╰╮│ ││ ││ ││ ││ ││ ││ ││ │╭╯ │ │ │ │╰╮
3.09 ┤ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ │╭╯ │ │ │
2.06 ┤ │ ││ ││ ││ ││ ││ ││ ╰╯ ╰╯ ╰╯ ╰╯ ╰─╯ │ ╭─╮
1.03 ┤ │ ││ ││ ╰╯ ╰╯ ╰╯ ╰╯ ╰─╯ ╰╮
0.00 ┼──╯ ╰╯ ╰╯
╰────────────────────────────────────────────────────────────────────────
Log2 Hamming Distribution of Hashes built
over low entropy mean 13.115629835964098
13.46 ┤ ╭╮ ╭╮ ╭╮
12.42 ┤ ╭╮ ││ ││ ││ ╭╮
11.39 ┤ ││ ││ ││ ││ ││
10.35 ┤ ╭╮ ││ ││ ││ ││ ││ ╭╮
9.32 ┤ ││ ││ ││ │╰╮ │╰╮ │╰╮ ││
8.28 ┤ ╭╮ ││ ││ │╰╮ │ │ │ │ │ │ │╰╮ ╭╮
7.25 ┤ ││ ││ │╰╮ │ │ │ │ │ │ │ │ │ │ │╰╮
6.21 ┤ ││ │╰╮ │ │╭╯ │╭╯ │╭╯ │ │ │ │ │ │ │ ╭╮
5.18 ┤ ╭╮ │╰╮╭╯ │╭╯ ││ ││ ││ │╭╯ │ │ │ │ │ │╰╮
4.14 ┤ ││ ╭╯ ││ ││ ││ ││ ││ ││ │╭╯ │ │ │ │ │
3.11 ┤ ╭╯╰╮│ ││ ││ ││ ││ ││ ││ ││ │╭╯ │ │ │ ╭╮
2.07 ┤ │ ││ ││ ││ ││ ││ ╰╯ ╰╯ ╰╯ ╰╯ ╰─╯ │ │╰╮
1.04 ┤ │ ││ ││ ╰╯ ╰╯ ╰╯ ╰─╯ ╰╮
0.00 ┼──────╯ ╰╯ ╰╯
╰────────────────────────────────────────────────────────────────────────
Log2 Hamming Distribution of Hashes built over
realistic checksums mean 14.770937790157847
versus fletcher hamming distributions
13.38 ┤ ╭╮ ╭╮ ╭╮
12.35 ┤ ╭╮ ││ ││ ││ ╭╮
11.32 ┤ ╭╮ ││ ││ ││ ││ ││ ╭╮
10.29 ┤ ╭╮ ││ ││ ││ ││ ││ ││ ││
9.26 ┤ ││ ││ ││ ││ │╰╮ │╰╮ │╰╮ ││ ╭╮
8.23 ┤ ╭╮ ││ ││ ││ │╰╮ │ │ │ │ │ │ │╰╮ ││
7.20 ┤ ││ ││ ││ │╰╮ │ │ │ │ │ │ │ │ │ │ │╰╮ ╭╮
6.17 ┤ ││ │╰╮ │╰╮╭╯ │╭╯ │╭╯ │╭╯ │ │ │ │ │ │ │ │╰╮
5.15 ┤ ││ ╭╯ │╭╯ ││ ││ ││ ││ │╭╯ │╭╯ │ │ │ │ │ ╭╮
4.12 ┤ ╭╯╰╮│ ││ ││ ││ ││ ││ ││ ││ │╭╯ │ │ │ │╰╮
3.09 ┤ │ ││ ││ ││ ││ ││ ││ ││ ││ ││ │╭╯ │ │ │
2.06 ┤ │ ││ ││ ││ ││ ││ ││ ╰╯ ╰╯ ╰╯ ╰╯ ╰─╯ │ ╭─╮
1.03 ┤ │ ││ ││ ╰╯ ╰╯ ╰╯ ╰╯ ╰─╯ ╰╮
0.00 ┼──╯ ╰╯ ╰╯
╰────────────────────────────────────────────────────────────────────────
Log2 Hamming Distribution of Hashes built
over low entropy mean 13.115629835964098
14.10 ┤ ╭─────────╮
13.10 ┤ ╭───╯ ╰──╮
12.09 ┤ ╭─╯ ╰──╮
11.08 ┤ ╭─╯ ╰─╮
10.07 ┤ ╭─╯ ╰╮
9.07 ┤ ╭╯ ╰╮
8.06 ┤ ╭─╯ ╰─╮
7.05 ┤ ╭─╯ ╰╮
6.04 ┤ ╭╯ ╰╮
5.04 ┤ ╭╯ ╰╮
4.03 ┤ ╭╯ ╰╮
3.02 ┤ ╭╯ ╰╮
2.01 ┤ ╭╯ ╰╮
1.01 ┤ ╭╯ ╰╮
-0.00 ┼──────╯
╰──────────────────────────────────────────────────────────────────────
Log2 Hamming Distribution of Hashes built over
realistic checksums mean 15.987403280718043
Ultimately I ran performance measurements on good Rust implementations for
both over millions of generated fragments and fletcher is on average as
fast as good double hashing although it shows bits more CPU outliers so it
doesn't matter what we'll use in terms of CPU load AFAIS
If we still want to toy with the idea of _per neighbor_ or _over time
varying_ hashes then it would boil down to the initial constants on
fletcher then AFAIS. Not a fan of it since CPU load goes up with #neighbors
and also possibly caching or otherwise permanent recomputation of things
...
For the really curious, played with Adler as well, no advantage over simple
fletcher
-- tony
On Thu, Oct 9, 2025 at 5:23 PM Tony Przygienda <[email protected]> wrote:
> yeah, I read you and it's an intriguing and helpful idea.
>
> Simplest form of that would be "roll every 3 times on HSNPs to a new
> random seed" which would be advertised in the HSNP header itself.
>
> viable, heavy on computation and caching
>
> like I said, 24 bits hash could be a cheaper cpu solution at cost of
> smaller coverage due to packet size
>
>
> On Thu, Oct 9, 2025 at 3:28 PM <[email protected]> wrote:
>
>> Tony,
>>
>>
>>
>> Thanks for the reply and for elaborating.
>>
>>
>>
>> 1.
>>
>> By “per neighbor” I meant per adjacency.
>>
>> I had though about just using a different seed, but I could not convince
>> myself that this would help with collision. XOR being commutative and
>> associative, it looks like one could equally applies the seed at the end
>> (last xor) hence after the collision. That’s why I had proposed something a
>> bit heavier with a slightly different hash function. But I would not
>> pretend being capable of following you.
>>
>>
>>
>> 2.
>>
>> Thanks for doing the math (for me) as the actual number(probability) does
>> matter.
>>
>> Not being familiar with Rust, I’ll wait for the next revision of the
>> draft.
>>
>>
>>
>> That being said, beyond xor, I’m not familiar with hash function and
>> their properties.
>>
>> My point was limited to saying that may be using spatial diversity (my
>> different adjacencies) could help with collision (if we can manage to have
>> different collisions depending on the adjacency.)
>>
>>
>>
>> --Bruno
>>
>>
>>
>> *From:* Tony Przygienda <[email protected]>
>> *Sent:* Wednesday, October 8, 2025 9:31 PM
>> *To:* DECRAENE Bruno INNOV/NET <[email protected]>
>> *Cc:* lsr <[email protected]>
>> *Subject:* Re: [Lsr] draft-prz-lsr-hierarchical-snps
>>
>>
>>
>> Bruno, thanks.
>>
>>
>>
>> Multiple things
>>
>>
>>
>> 1. I am not sure what "per neighbor" here means. We could take some
>> random number, advertise that in the hellos and use that as seed. all good
>> except hellos are unreliable. Also, things like "take neighbors address"
>> are difficult to do due to unnumbered links and so on. Ultimately such a
>> seed, unless it varies over time reliably, is just another constant in
>> hashing and I doubt it will improve any probabilities. But, yes, within a
>> system we _could_ vary some constant used to build the hashes regularly
>> (kind of like nonce) and advertise this constant in the HSNP itself.
>> computationally expensive for the neighbor to compare and even if the
>> numbers change slowly we need to cache per neighbor the whole thing (since
>> each neighbor uses a different seed) but yes, it would vastly lower the
>> probability of a "hash clash" of course.
>>
>> 2. To give the math quickly once again. I started with 64 bits (and we
>> could do it) but things were big in terms of packet sizes and I realized
>> with math it's overkill AFAIS. Today fletcher (AFAIR ISIS uses a variant of
>> that) is 16 bits so assuming uniform distribution and independent variables
>> we could say that a HSNP checksum over 4E9 (2^32) fragments statistically
>> has the same "strength" as fletcher over a single fragment. Generally in
>> 1st level HSNP we'll cover about 80 ranges each over max. 60 fragments or
>> so (if I remember the numbers in the draft correctly) so probability is
>> roughly 24E2 / 4E9 of that, 2nd level each range can pack a full packet of
>> L1 ranges so we can say 80 * 80 * 60 fragments which is 38E4 / 4E9. so in
>> plain numbers we are talking in very coarse terms of 0.0000001 and 0.00001
>> chance of things clashing. Once we move to layer 3 we are in realm of 30E6
>> fragments database, not something I consider likely but for the sake of
>> argument that's 3E7/4E9 and then we're in 0.001 magnitude, i.e. there is a
>> chance of about 0.1% that two databases containing 30 million fragments end
>> up producing somewhere the same hash for two fragments. Of course it
>> matters a _lot_ where this same hash is produced. If it's in the same L1
>> range then XOR'ing both of them will basically remove both fragments. If
>> not in the same range it's basically meaningless since range above it will
>> not see the clash anymore. All gets mathematically hairy to analyze and I
>> leave it at that. So if with all this people still think that all
>> necessitates solution 1 with a new random number every couple iterations,
>> that's doable, we can also go to 24 bit HSNP hashes easily.
>>
>>
>>
>> For the not so faint of heart, here's the fast double hashing of a
>> fragment that seems to work very well producing good hamming distances over
>> realistic scenarios and should make it into the next version of draft. I
>> skip lots of definitions and just show the core Rust snippet. Anyone with
>> some good foundation in statistics/hashing theory is more than welcome to
>> criticize. We are not bound by self inverse hashing BTW for building hash
>> of a fragment. The self-inversion (i.e. XOR) is only of interest later to
>> allow for quick adjustment of caches over very similar sets.
>>
>>
>>
>> Distributions of hamming produce weighted averages of about 14 bits
>> difference for very low entropy checksum (i.e. checksums vary by 1 or 2
>> bits only for the tested fragments) and about 17 bits of the 32 bits with
>> checksum being a uniform random number generated for each fragment used. So
>> it looks like we're having enough entropy playing here.
>>
>>
>>
>> impl Into<HSNPHash> for (&SharedFragmentID, &FragmentContent) {
>> fn into(self) -> HSNPHash {
>> // 16 bit primes to fill in the 64 bits
>> let SEED1: u64 = 58417;
>> let SEED2: u64 = 47527;
>> const BYTEINBITS: usize = size_of::<u8>() * 8;
>> let (fragmentid, fragmentcontent) = self;
>>
>> let nodeid_iterator = &fragmentid.node.node_id().0;
>>
>> let foldednodeid1 = nodeid_iterator
>> .iter()
>> .fold(SEED1, |p, n| p.rotate_left(BYTEINBITS as _) ^ *n as
>> u64);
>> let foldednodeid2 = nodeid_iterator
>> .iter()
>> .rev()
>> .fold(SEED2, |p, n| p.rotate_left(BYTEINBITS as _) ^ *n as
>> u64);
>> // elements to rotate in with their byte size
>> let rotate_in = [
>> (
>> fragmentcontent.checksum.0 as _,
>> size_of_val(&fragmentcontent.checksum.0),
>> ),
>> (fragmentid.pnode.0 as u64, size_of_val(&fragmentid.pnode.0)),
>> (
>> fragmentid.fragmentnr.0 as _,
>> size_of_val(&fragmentid.fragmentnr.0),
>> ),
>> (
>> fragmentcontent.seqnr.0 as _,
>> size_of_val(&fragmentcontent.seqnr.0),
>> ),
>> ];
>>
>> let extendedhash1 = rotate_in
>> .into_iter()
>> .fold(foldednodeid1, |p, (v, s)| p.rotate_left((s *
>> BYTEINBITS) as _) ^ v);
>> let extendedhash2 = rotate_in
>> .into_iter()
>> .rev()
>> .fold(foldednodeid2, |p, (v, s)| p.rotate_left((s *
>> BYTEINBITS) as _) ^ v);
>>
>> (extendedhash1 ^ extendedhash2).into()
>> }
>> }
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Oct 8, 2025 at 4:02 PM <[email protected]> wrote:
>>
>> Hi all,
>>
>> [Thinking out loud, so I'm pretty sure I'll regret sending this email ...]
>>
>> With regards to hash collision, I'm wondering whether we could benefit
>> from using a different hash (method/function) per neighbor, and/or over
>> time. (As a priori it's enough for me to detect my LSDB inconsistency with
>> a single neighbor)
>> This requires more computation but assuming that a significant
>> contribution to collision is the final step (from 64 bits to 32 bits),
>> rotating a number of bits in one 32-bit int on a per neighbor basis seems
>> relatively low cost.
>>
>> Regards,
>> --Bruno
>>
>> ____________________________________________________________________________________________________________
>> Ce message et ses pieces jointes peuvent contenir des informations
>> confidentielles ou privilegiees et ne doivent donc
>> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez
>> recu ce message par erreur, veuillez le signaler
>> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
>> electroniques etant susceptibles d'alteration,
>> Orange decline toute responsabilite si ce message a ete altere, deforme
>> ou falsifie. Merci.
>>
>> This message and its attachments may contain confidential or privileged
>> information that may be protected by law;
>> they should not be distributed, used or copied without authorisation.
>> If you have received this email in error, please notify the sender and
>> delete this message and its attachments.
>> As emails may be altered, Orange is not liable for messages that have
>> been modified, changed or falsified.
>> Thank you.
>>
>> _______________________________________________
>> Lsr mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>> ____________________________________________________________________________________________________________
>> Ce message et ses pieces jointes peuvent contenir des informations
>> confidentielles ou privilegiees et ne doivent donc
>> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
>> ce message par erreur, veuillez le signaler
>> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
>> electroniques etant susceptibles d'alteration,
>> Orange decline toute responsabilite si ce message a ete altere, deforme ou
>> falsifie. Merci.
>>
>> This message and its attachments may contain confidential or privileged
>> information that may be protected by law;
>> they should not be distributed, used or copied without authorisation.
>> If you have received this email in error, please notify the sender and
>> delete this message and its attachments.
>> As emails may be altered, Orange is not liable for messages that have been
>> modified, changed or falsified.
>> Thank you.
>>
>>
_______________________________________________
Lsr mailing list -- [email protected]
To unsubscribe send an email to [email protected]