I went through commits on master that aren't in 7.72.0 and came across 728f8d3bd <https://github.com/curl/curl/commit/728f8d3bdc336e3fa838f45cad6c0133a6b604ae>, which looks promising. It says it partially reverts a change from 7.65.2, which is kind of weird because this problem does *not *occur for me in 7.71.1, which suggests that one of the other changes in 7.72.0 (maybe the schannel stuff?) somehow had some spooky-action-at-a-distance and made this much worse in 7.72.0. I feel like 728f8d3bd still has a race condition if the FIN has been issued by the server but not yet processed by poll, but that's neither here nor there.
Anyhow, I cherry-picked 728f8d3bd and it does seem to have fixed my issue (or, at least, my issue hasn't occurred in the intervening ~30 minutes). On Mon, Sep 14, 2020 at 2:38 PM James Brown <jbr...@easypost.com> wrote: > Unfortunately, it takes rather several hours of running to identify this > failure; building a new package and bisecting is going to be quite > difficult with that long of a turnaround. > > I however was able to get a tcpdump of one of the affected sessions. > > This session spans several requests to the same backend (Typhoeus pools > curl sockets and tries to use keep-alive whenever possible). The first few > requests succeed; at some point there's a ~9-second break in requests while > the process does other stuff and the server closes the TCP stream (our load > balancer is configured to close idle keep-alive sessions after 9 seconds). > Within a few milliseconds, another request is sent on the same socket; it > immediately gets back a RST because the server has already closed this > socket. That appears to cause this exception to be raised by libcurl. > > Did anything change around handling keep-alive session expiry in HTTP/1.1 > mode? Nothing jumps out at me in the git log (there's something in the > schannel backend, but I'm on Linux; there's also d5bb459ccf > <https://github.com/curl/curl/commit/d5bb459ccf1fc5980ae4b95c05b4ecf6454a7599> > which > claims to only affect CONNECT-only connections, and all of this is regular > GETs and POSTs)... > > On Fri, Sep 11, 2020 at 10:34 PM Ray Satiro via curl-library < > curl-library@cool.haxx.se> wrote: > >> On 9/11/2020 2:03 PM, James Brown via curl-library wrote: >> >> After upgrading a test cluster from 7.71.1 to 7.72.0, we're now seeing >> around 0.1% of POSTs from one (and only one) of our applications fail with >> "Failed sending data to the peer" (CURLE_SEND_ERROR) and no other error. >> Based on logs, the request actually succeeds, but libcurl is returning this >> error. This application is using the Ruby Typhoeus wrapper and is itself >> unchanged. The relevant connections are all HTTP/1.1 connections to hosts >> on the local network, and the POSTs are all very small (<1KB) with nothing >> interesting about them. >> >> I haven't had any luck tracking this down since it's such a low fraction >> of requests and is only affecting one of our several hundred applications, >> but it reproducibly happens with 7.72 and not with 7.71.1. >> >> Anyone have any suggestions for how to try to track down the regression? >> I looked at the diff between 7.71.1 and 7.72.0 and no lines containing the >> string "CURLE_SEND_ERROR" were touched, which is unfortunate. >> >> >> There are no similar reports and I looked through the commit history but >> nothing stood out. If you can reliably reproduce then try bisecting it >> https://github.com/curl/curl/wiki/how-to-git-bisect >> >> >> ------------------------------------------------------------------- >> Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library >> Etiquette: https://curl.haxx.se/mail/etiquette.html > > > > -- > James Brown > Engineer > -- James Brown Engineer
------------------------------------------------------------------- Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library Etiquette: https://curl.haxx.se/mail/etiquette.html