|
http://lwn.net/Articles/304791/ A seemingly innocuous change to the networking code that went into the 2.6.27 kernel is now causing trouble for various distributions. Ubuntu, Fedora, and openSUSE are all buttoning up their packages for a release in the near future—with Ubuntu's due this week—so kernel changes are not particularly welcome. Unfortunately, if the problem is not addressed, some users may never be able to download a fix because their TCP/IP won't interoperate with some broken equipment on the internet. The problem stems from changes that were made to clean up the TCP option code that were merged back in July as part of the 2.6.27 merge window. TCP options are a mechanism to expand the functionality of the protocol as conditions change. There are a handful of commonly used options that the two endpoints of a connection can agree to use, for things like maximum segment size (MSS), window scaling, selective acknowledgment (SACK), and timestamps. Options have been added over time to provide more internet robustness and performance as well as to support higher-bandwidth physical connections. A perfectly reasonable, if unintended, consequence of the code change was that the the options were put into the header in a slightly different order. According to the relevant RFCs, options can appear in any order in the option section of the TCP header. But, some home and/or internet routers seem to expect a fixed order; refusing to make connections if the order is "wrong". In particular, it would seem that the MSS option needs to appear before the SACK option. The bug was reported to Ubuntu Launchpad in early September, but not a lot of progress was made until it was added to the kernel.org bugzilla in early October. It seems to have only affected a relatively small number of users—Red Hat's Dave Jones said that there were no reports from users of the rawhide 2.6.27 kernel—as it was rather hardware-specific. This made it difficult to track down for the majority of folks who couldn't reproduce it. Ubuntu user Aldo Maggi, who filed the kernel bug, sets a marvelous example of how to work with the kernel hackers to track down the problem as can be seen in the bugzilla entry. Eventually, the option re-ordering problem was discovered and a patch was submitted by Ilpo Järvinen that restored the order of the options. Along the way, with help from Mandriva, it was discovered that turning off TCP timestamps by way of: sysctl -w net.ipv4.tcp_timestamps=0worked around the problem without changing the kernel—at the cost of losing the TCP timestamp functionality. So it would seem that the problem has been solved—the patch has been merged into Linus Torvalds's tree for 2.6.28—but there are still a few unresolved issues. The three distributions that are preparing new releases are all based on 2.6.27, but as yet, there has not been a -stable kernel release that picks up the patch, though it is likely to come fairly soon. In the meantime, Fedora has added the patch to its kernel in rawhide, so Fedora 10 (and eventually Fedora 9 when it gets rebased on 2.6.27) will have the fix. openSUSE is waiting a bit to see what gets submitted by the kernel networking developers to the -stable team. As Novell/SUSE kernel hacker Greg Kroah-Hartman puts it: "We still have a while to go before the final 11.1 kernel is released, so we feel no pressure here." Unfortunately, Ubuntu got caught very late in its release cycle as 8.10 (or Intrepid Ibex) is due on October 30. The original plan as outlined by Debian/Ubuntu hacker Steve Langasek was to note the problem in the release notes for 8.10, but not address the underlying problem until after the release: The kernel fix is known upstream; implementing it
requires kernel uploads
and installer rebuilds, which it's just not possible to fit in between
the
release candidate and the release. We will certainly want to include
this
fix in a kernel update as soon as possible after the release, but this
is
unfortunately in a class of bugs that we can't fix the week of release
(even
turning timestamps off requires a kernel upload, unless we want to
permanently disable tcp timestamp support for Ubuntu 8.10).
That led many in the Launchpad bug thread to note that it was going to be a real mess, especially for the least technical of users. Nick Lowe sums up the problem: [...] You should really delay for this if you
need more time...
RC shouldn't mean Release ComeHellOrHighWater The users who are most likely to hit this are home users behind their aged/unmaintained consumer routers who are highly unlikely to understand why they can't access the Web and will just go elsewhere... Certainly, the release notes are not the first place an affected user would go if they ran into the problem. More than likely, they would just decide that Ubuntu—by extension Linux—is simply broken, so it is a relief to see that Ubuntu eventually relented. For 8.10, the procps package has been changed to work around the problem by turning off timestamps. Once a new kernel package is released with the re-ordering patch included, timestamps can presumably be restored. This kind of problem—where affected users may not be able to retrieve an update to fix it—should really be part of the definition of a show-stopping (i.e. release date slipping) problem. It was rather galling to some that Ubuntu would consider shipping with this known issue, simply to make its 8.10 release in the 10th month of 2008 (which is how Ubuntu releases are numbered). Ubuntu is justifiably proud of its record of shipping releases on time, but it cannot do that at the expense of its users. While the workaround that was implemented was suboptimal, perhaps, it does ensure that users—especially non-technical users—won't find that web surfing doesn't work in Linux. It should also allow Ubuntu to release on schedule. [ Thanks to Nick Lowe for giving us a heads-up about this issue. ] |
