On Sat, Mar 25, 2017 at 10:34 AM, Gregory Szorc <gregory.sz...@gmail.com> wrote:
> On Sat, Mar 25, 2017 at 4:19 AM, Mike Hommey <m...@glandium.org> wrote: > >> Hi, >> >> I don't know about you, but occasionally, I've hit "stream ended >> unexpectedly (got m bytes, expected n)" errors that didn't make sense. >> Retrying would always work. >> >> Recently, I was trying to use signal.setitimer and a signal handler for >> some memory profiling on git-cinnabar, which uses mercurial as a >> library, and got "stream ended 4 unexpectedly (got m bytes, expected n)" >> *very* reproducibly. Like, with an interval timer firing every second, >> it took only a few seconds to hit the error during a clone. >> >> I'm pretty sure this can be reproduced with a similar setup in mercurial >> itself. >> >> Now, the reason this happens in this case is that, the code that fails >> does: >> >> def readexactly(stream, n): >> '''read n bytes from stream.read and abort if less was available''' >> s = stream.read(n) >> if len(s) < n: >> raise error.Abort(_("stream ended unexpectedly" >> " (got %d bytes, expected %d)") >> % (len(s), n)) >> return s >> >> ... and thanks to POSIX, interrupted reads can lead to short reads. So, >> you request n bytes, and get less, just because something interrupted >> the process. The problem then is that python doesn't let you know why >> you just got a short read, and you have to figure that out on your own. >> >> The same kind of problem is also true to some extent on writes. >> >> Now, the problem is that this piece of code is likely the most visible >> place where the issue exists, but there are many other places in the >> mercurial code base that are likely affected. >> >> And while the signal.setitimer case is a corner case (and I ended up >> using a separate thread to work around the problem ; my code wasn't >> interruption safe either anyways), I wonder if those random "stream >> ended unexpectedly (got m bytes, expected n)" errors I was getting under >> normal circumstances are not just a manifestation of the same underlying >> issue, which is that the code doesn't like interrupted reads. >> >> Disclaimer: I'm not going to work on fixing this issue, but I figured >> I'd let you know, in case someone wants to look into it more deeply. >> > > Thank you for writing this up. This "stream ended unexpectedly" has been a > thorn in my side for a while, as it comes up frequently in Mozilla's CI > with a frequency somewhere between 1 in 100-1000. Even retrying failed > operations multiple times isn't enough to overcome it > > I have long suspected interrupted system calls as a likely culprit. > However, when I initially investigated this a few months ago, I found that > Python's I/O APIs retry automatically for EINTR. See > https://hg.python.org/cpython/file/54c93e0fe79b/Lib/socket.py#l365 for > example. This /should/ make e.g. socket._fileobject.read() resilient > against signal interruption. (If Python's I/O APIs didn't do this, tons of > programs would break. Also, the semantics of .read() are such that it is > always supposed to retrieve all available bytes until EOF - at least for > some io ABCs. read1() exists to perform at most 1 system call.) > > But you provide compelling evidence that signal interruptions do in fact > result in truncated reads. So something is obviously going wrong. > > The stream.read() in changegroup.readexactly() actually follows a pretty > complicated path. Here is what happens when reading from an HTTP response: > > 1. stream is a mercurial.httppeer.readerproxy > 2. The readerproxy is constructed from util._zlibengine.decompressorreader > and a mercurial.keepalive.HTTPResponse instance > 3. decompressorreader takes the file object interface from the > HTTPResponse, constructs an util.filechunkiter, feeds that into a > zlib.decompressobj and turns that into a util.chunkbuffer > 4. When you stream.read(), it has to traverse through a util.chunkbuffer, > a generator of zlib decompressed chunks, a util.filechunkiter, and HTTP > chunked transfer decoding before it finally gets to a recv() in > socket._fileobject.read(). > > Any of those things could have a bug where a signal interrupts things or a > truncated read isn't retried. I suspect our bug is somewhere in this chain. > > I have some other wrinkles to add. > > Mozilla sees these stream failures much more frequently on Windows. They > do happen on Linux and OS X in the wild. But they are >10x more frequent on > Windows. I'm pretty sure we're running 2.7.12 on Windows (on my insistence > to help isolate an old Python as the source of the bug). > > Mozilla also sees "unexpected negative part header" errors. e.g. > https://bugzilla.mozilla.org/show_bug.cgi?id=1333348. The code in bundle2 > that is emitting this error also uses changegroup.readexactly. I've long > suspected the cause of this is a bit flip or reading a truncated or > uninitialized 4 byte field. I would not at all be surprised if "stream > ended unexpectedly" and "unexpected negative part header" share the same > root cause. > > Anyway, since you can reproduce this pretty reliably with a constant > barrage of signals, I reckon the next step is for someone to record program > execution with rr and find out how signals are resulting in data > truncation. I was tentatively planning to spend some time next week to look > into this general problem (because it is plaguing Mozilla). So I guess that > person will be me. > > Thanks again for the write-up. It is very helpful. > Mozilla started seeing "stream ended unexpectedly" a lot more frequently after upgrading server repos to generaldelta. Specifically, clients with non-generaldelta clones started seeing the failure more often. I packet traced my way to glory: https://bugzilla.mozilla.org/show_bug.cgi?id=1291926#c78 tl;dr it looks like hg client consumes/buffers server output, becomes idle over the network for a while because client is taking a while to apply data, and the server drops the connection, leading to "stream ended unexpectedly." Making this confusing is that an OSError for the server-side drop is masked due to bundle2 trying to recover and read to EOF. Expect some patches from me for the 4.2 cycle. I'm still trying to reproduce the signal-based failure. So many root causes all leading to same "stream ended unexpectedly." The least we can do in 4.2 is make the error reporting better.
_______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel