> On Mar 27, 2017, at 08:13, Yuya Nishihara <y...@tcha.org> wrote: > >> On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote: >> # HG changeset patch >> # User Gregory Szorc <gregory.sz...@gmail.com> >> # Date 1490554582 25200 >> # Sun Mar 26 11:56:22 2017 -0700 >> # Branch stable >> # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93 >> # Parent ed5b25874d998ababb181a939dd37a16ea644435 >> changegroup: retry read() when partial data returned >> >> We've seen a rash of "stream ended unexpectedly" errors in the >> wild. This occurs when changegroup.readexactly() fails to retrieve >> the exact number of requested bytes in a single stream.read() >> operation. >> >> There are several plausible theories that explain this behavior. >> Many include underlying APIs not retrying automatically when >> interrupted by a signal. While Python is supposed to do this, >> it could be buggy. There could also be a bug in the various >> stream reading layers between changegroup.readexactly() and the >> underlying file descriptor. >> >> Unfortunately, getting this error to reproduce has been extremely >> difficult and no single cause has been identified. >> >> Since networks (and even filesystems) are unreliable, it seems >> wise to have some form of retry in place to mitigate this failure. >> This patch adds that retry logic. >> >> There is already basic test coverage that the exception in this >> function is raised. The tests didn't change with this patch. >> >> The only obvious negative impact to this change I can see is that >> in cases where the read operation fails, there is some overhead >> involved with retrying the operation. In the worst case, this is >> pure overhead: we'd retry an operation and get the same outcome. >> But in the best case, it avoids an intermittent/random failure. >> My assumption is that the overhead to retry is small and that >> the cost to avoiding a random failure is justified. >> >> There are other changes I'd like to make to stream reading code >> to help flush out this general failure of partial stream reads. >> See issue4923 for some suggestions with regards to swallowing >> exception and masking the underlying error. However, these >> changes aren't suitable for the stable branch. This patch feels >> minimally invasive and if successful would let us narrow our focus >> for finding the underlying bug. >> >> diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py >> --- a/mercurial/changegroup.py >> +++ b/mercurial/changegroup.py >> @@ -35,12 +35,42 @@ from . import ( >> >> def readexactly(stream, n): >> '''read n bytes from stream.read and abort if less was available''' >> - s = stream.read(n) >> - if len(s) < n: >> + # Most of the time, stream.read() returns exactly the number of bytes >> + # requested. Various Python APIs will perform multiple system calls >> + # to retrieve more bytes if the first call did not return enough. This >> + # includes cases where the system call is interrupted. >> + # >> + # Even with this retry logic in the Python APIs, we still see failure >> + # to retrieve the requested number of bytes. So, we build in our own >> + # retry layer here. >> + left = n >> + chunk = stream.read(n) >> + left -= len(chunk) >> + assert left >= 0 >> + >> + # Common case is it just works. >> + if not left: >> + return chunk >> + >> + chunks = [chunk] >> + # Retry to get remaining data. In cases where the stream has errored or >> + # is at EOF, this will do a bit of redundant work. But it helps prevent >> + # intermittent failures and isn't common. So the overhead is acceptable. >> + for i in range(3): >> + chunk = stream.read(left) >> + chunks.append(chunk) >> + left -= len(chunk) >> + assert left >= 0 >> + >> + if not left: >> + break > > Do you have some number that supports this change actually mitigates the > problem?
No. I could try to get that this week. > > Suppose underlying functions have bugs, I don't think we can always get > contiguous chunks with no data loss. If a few bytes lost for example, we > could read arbitrary part as a length field, and call readexactly() with > that invalid length value. Hmm. My assumption is that any error that would result in data loss effectively poisons the stream and prevents future reads. _______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel