-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dear Networking experts,
I have been fighting for several months with the fact that invitations often seem not to work, when running on a serverless mesh. The symptoms are quite strange. If an invitation works once between two laptops, it continues to work between them reliably. If it fails once, it continues to fail between them consistently. Sometimes, in the same place, invitations will work on one mesh channel and not on another. The same two XOs may be reliably successful in a particular high-noise environment, and consistently fail in an area of virtual radio silence, as well as the reverse. Even when invitations fail, other presence information continues to flow correctly. Even activity sharing continues to work beautifully. With some help from Daf, we managed to get a tcpdump trace from two XOs exhibiting this behavior at 1CC. The dumps are attached to ticket #6463. ~ What we saw is bizarre, but also consistent with the behavior in the UI. ~ The invitations are unicast, implemented using TCP. When machine A sends an invitation to B, we see the following exchange: 1. A broadcasts an ARP request for B 2. B sees the ARP request and replies to A 3. A receives the ARP reply from B and sends a TCP SYN to B 4. B does not see the SYN packet (it does not appear in B's dump) 5. A retries a total of three times, but none of the SYN packets are seen by B. 3b. In parallel, A broadcasts a presence-info update with mDNS, indicating that it has shared the activity. 4b. B receives this broadcast, updates its presence-info cache, and even assigns B's XO icon a new location in the mesh view This behavior is fairly frightening. I have seen it occur in low-noise network environments with a total of 3 XOs, so I suspect a serious bug somewhere in the lowest levels of the network stack. Once this failure occurs, it is extremely reproducible. All subsequent invitations will continue to fail. I therefore suspect that the bug involves the driver or firmware reaching an invalid state and becoming stuck there. Given the variety of critical services that run over TCP, including the much-emphasized Read activity, I hope that people familiar with the driver and firmware will take a look at this bug. - --Ben Schwartz P.S. All this info is present at ticket #6463. I am writing about it here in an attempt to increase awareness. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkhOz+oACgkQUJT6e6HFtqSVBQCeKPWmqeoKOzVv55JS/HTAgf1r bUYAoKCG+z1bBA+isc7Mun0VlQNGDars =4w83 -----END PGP SIGNATURE----- _______________________________________________ Devel mailing list [email protected] http://lists.laptop.org/listinfo/devel
