# HG changeset patch # User Gregory Szorc <gregory.sz...@gmail.com> # Date 1490554582 25200 # Sun Mar 26 11:56:22 2017 -0700 # Branch stable # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93 # Parent ed5b25874d998ababb181a939dd37a16ea644435 changegroup: retry read() when partial data returned
We've seen a rash of "stream ended unexpectedly" errors in the wild. This occurs when changegroup.readexactly() fails to retrieve the exact number of requested bytes in a single stream.read() operation. There are several plausible theories that explain this behavior. Many include underlying APIs not retrying automatically when interrupted by a signal. While Python is supposed to do this, it could be buggy. There could also be a bug in the various stream reading layers between changegroup.readexactly() and the underlying file descriptor. Unfortunately, getting this error to reproduce has been extremely difficult and no single cause has been identified. Since networks (and even filesystems) are unreliable, it seems wise to have some form of retry in place to mitigate this failure. This patch adds that retry logic. There is already basic test coverage that the exception in this function is raised. The tests didn't change with this patch. The only obvious negative impact to this change I can see is that in cases where the read operation fails, there is some overhead involved with retrying the operation. In the worst case, this is pure overhead: we'd retry an operation and get the same outcome. But in the best case, it avoids an intermittent/random failure. My assumption is that the overhead to retry is small and that the cost to avoiding a random failure is justified. There are other changes I'd like to make to stream reading code to help flush out this general failure of partial stream reads. See issue4923 for some suggestions with regards to swallowing exception and masking the underlying error. However, these changes aren't suitable for the stable branch. This patch feels minimally invasive and if successful would let us narrow our focus for finding the underlying bug. diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py --- a/mercurial/changegroup.py +++ b/mercurial/changegroup.py @@ -35,12 +35,42 @@ from . import ( def readexactly(stream, n): '''read n bytes from stream.read and abort if less was available''' - s = stream.read(n) - if len(s) < n: + # Most of the time, stream.read() returns exactly the number of bytes + # requested. Various Python APIs will perform multiple system calls + # to retrieve more bytes if the first call did not return enough. This + # includes cases where the system call is interrupted. + # + # Even with this retry logic in the Python APIs, we still see failure + # to retrieve the requested number of bytes. So, we build in our own + # retry layer here. + left = n + chunk = stream.read(n) + left -= len(chunk) + assert left >= 0 + + # Common case is it just works. + if not left: + return chunk + + chunks = [chunk] + # Retry to get remaining data. In cases where the stream has errored or + # is at EOF, this will do a bit of redundant work. But it helps prevent + # intermittent failures and isn't common. So the overhead is acceptable. + for i in range(3): + chunk = stream.read(left) + chunks.append(chunk) + left -= len(chunk) + assert left >= 0 + + if not left: + break + + if left: raise error.Abort(_("stream ended unexpectedly" " (got %d bytes, expected %d)") - % (len(s), n)) - return s + % (n - left, n)) + + return b''.join(chunks) def getchunk(stream): """return the next chunk from stream as a string""" _______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel