Re: Failed sending data to the peer in 7.72.0

2020-09-16 Thread James Brown via curl-library
I've been running 7.72.0 with that patch cherry-picked for over a day and
the issue has not recurred. There's definitely something spooky going on.
Does extract_if_dead have any other side effects?

On Wed, Sep 16, 2020 at 1:22 AM Daniel Stenberg  wrote:

> On Mon, 14 Sep 2020, James Brown via curl-library wrote:
>
> Thanks for digging deep into this issue!
>
> > I went through commits on master that aren't in 7.72.0 and came across
> > 728f8d3bd
> > <
> https://github.com/curl/curl/commit/728f8d3bdc336e3fa838f45cad6c0133a6b604ae>,
>
> > which looks promising. It says it partially reverts a change from
> 7.65.2,
> > which is kind of weird because this problem does *not *occur for me in
> > 7.71.1, which suggests that one of the other changes in 7.72.0 (maybe
> the
> > schannel stuff?) somehow had some spooky-action-at-a-distance and made
> this
> > much worse in 7.72.0.
>
> Well, the problem this commit fixes is the opposite: libcurl would
> previously
> reuse connections that actually are already dead and thus detecting the
> "deadness" gets delayed.
>
> > I feel like 728f8d3bd still has a race condition if the FIN has been
> issued
> > by the server but not yet processed by poll, but that's neither here nor
> > there.
>
> It's a race, sure, but that shouldn't be a problem. If the connection is
> dead
> at the time extract_if_dead() is called, it should detect that and not
> reuse
> the connection.
>
> If it dies a millisecond after the check, the connection will instead be
> reused and the sending of the outgoing request will subsequently fail.
> libcurl
> will then attempt to create a new connection and retry the request over
> that.
>
> If indeed calling extract_if_dead() there causes the problem for you, then
> I'd
> like to get more information about what exactly the problem is within that
> function so that we can address that rather than just reverting the commit
> again and just go full yoyo on these issues!
>
> --
>
>   / daniel.haxx.se | Commercial curl support up to 24x7 is available!
>| Private help, bug fixes, support, ports, new features
>| https://www.wolfssl.com/contact/
>


-- 
James Brown
Engineer
---
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette:   https://curl.haxx.se/mail/etiquette.html

Re: Failed sending data to the peer in 7.72.0

2020-09-16 Thread Daniel Stenberg via curl-library

On Mon, 14 Sep 2020, James Brown via curl-library wrote:

Thanks for digging deep into this issue!

I went through commits on master that aren't in 7.72.0 and came across 
728f8d3bd 
, 
which looks promising. It says it partially reverts a change from 7.65.2, 
which is kind of weird because this problem does *not *occur for me in 
7.71.1, which suggests that one of the other changes in 7.72.0 (maybe the 
schannel stuff?) somehow had some spooky-action-at-a-distance and made this 
much worse in 7.72.0.


Well, the problem this commit fixes is the opposite: libcurl would previously 
reuse connections that actually are already dead and thus detecting the 
"deadness" gets delayed.


I feel like 728f8d3bd still has a race condition if the FIN has been issued 
by the server but not yet processed by poll, but that's neither here nor 
there.


It's a race, sure, but that shouldn't be a problem. If the connection is dead 
at the time extract_if_dead() is called, it should detect that and not reuse 
the connection.


If it dies a millisecond after the check, the connection will instead be 
reused and the sending of the outgoing request will subsequently fail. libcurl 
will then attempt to create a new connection and retry the request over that.


If indeed calling extract_if_dead() there causes the problem for you, then I'd 
like to get more information about what exactly the problem is within that 
function so that we can address that rather than just reverting the commit 
again and just go full yoyo on these issues!


--

 / daniel.haxx.se | Commercial curl support up to 24x7 is available!
  | Private help, bug fixes, support, ports, new features
  | https://www.wolfssl.com/contact/
---
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette:   https://curl.haxx.se/mail/etiquette.html

Re: Failed sending data to the peer in 7.72.0

2020-09-14 Thread James Brown via curl-library
I went through commits on master that aren't in 7.72.0 and came across
728f8d3bd
,
which looks promising. It says it partially reverts a change from 7.65.2,
which is kind of weird because this problem does *not *occur for me in
7.71.1, which suggests that one of the other changes in 7.72.0 (maybe the
schannel stuff?) somehow had some spooky-action-at-a-distance and made this
much worse in 7.72.0. I feel like 728f8d3bd still has a race condition if
the FIN has been issued by the server but not yet processed by poll, but
that's neither here nor there.

Anyhow, I cherry-picked 728f8d3bd and it does seem to have fixed my issue
(or, at least, my issue hasn't occurred in the intervening ~30 minutes).

On Mon, Sep 14, 2020 at 2:38 PM James Brown  wrote:

> Unfortunately, it takes rather several hours of running to identify this
> failure; building a new package and bisecting is going to be quite
> difficult with that long of a turnaround.
>
> I however was able to get a tcpdump of one of the affected sessions.
>
> This session spans several requests to the same backend (Typhoeus pools
> curl sockets and tries to use keep-alive whenever possible). The first few
> requests succeed; at some point there's a ~9-second break in requests while
> the process does other stuff and the server closes the TCP stream (our load
> balancer is configured to close idle keep-alive sessions after 9 seconds).
> Within a few milliseconds, another request is sent on the same socket; it
> immediately gets back a RST because the server has already closed this
> socket. That appears to cause this exception to be raised by libcurl.
>
> Did anything change around handling keep-alive session expiry in HTTP/1.1
> mode? Nothing jumps out at me in the git log (there's something in the
> schannel backend, but I'm on Linux; there's also d5bb459ccf
> 
>  which
> claims to only affect CONNECT-only connections, and all of this is regular
> GETs and POSTs)...
>
> On Fri, Sep 11, 2020 at 10:34 PM Ray Satiro via curl-library <
> curl-library@cool.haxx.se> wrote:
>
>> On 9/11/2020 2:03 PM, James Brown via curl-library wrote:
>>
>> After upgrading a test cluster from 7.71.1 to 7.72.0, we're now seeing
>> around 0.1% of POSTs from one (and only one) of our applications fail with
>> "Failed sending data to the peer" (CURLE_SEND_ERROR) and no other error.
>> Based on logs, the request actually succeeds, but libcurl is returning this
>> error. This application is using the Ruby Typhoeus wrapper and is itself
>> unchanged. The relevant connections are all HTTP/1.1 connections to hosts
>> on the local network, and the POSTs are all very small (<1KB) with nothing
>> interesting about them.
>>
>> I haven't had any luck tracking this down since it's such a low fraction
>> of requests and is only affecting one of our several hundred applications,
>> but it reproducibly happens with 7.72 and not with 7.71.1.
>>
>> Anyone have any suggestions for how to try to track down the regression?
>> I looked at the diff between 7.71.1 and 7.72.0 and no lines containing the
>> string "CURLE_SEND_ERROR" were touched, which is unfortunate.
>>
>>
>> There are no similar reports and I looked through the commit history but
>> nothing stood out. If you can reliably reproduce then try bisecting it
>> https://github.com/curl/curl/wiki/how-to-git-bisect
>>
>>
>> ---
>> Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
>> Etiquette:   https://curl.haxx.se/mail/etiquette.html
>
>
>
> --
> James Brown
> Engineer
>


-- 
James Brown
Engineer
---
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette:   https://curl.haxx.se/mail/etiquette.html

Re: Failed sending data to the peer in 7.72.0

2020-09-14 Thread James Brown via curl-library
Unfortunately, it takes rather several hours of running to identify this
failure; building a new package and bisecting is going to be quite
difficult with that long of a turnaround.

I however was able to get a tcpdump of one of the affected sessions.

This session spans several requests to the same backend (Typhoeus pools
curl sockets and tries to use keep-alive whenever possible). The first few
requests succeed; at some point there's a ~9-second break in requests while
the process does other stuff and the server closes the TCP stream (our load
balancer is configured to close idle keep-alive sessions after 9 seconds).
Within a few milliseconds, another request is sent on the same socket; it
immediately gets back a RST because the server has already closed this
socket. That appears to cause this exception to be raised by libcurl.

Did anything change around handling keep-alive session expiry in HTTP/1.1
mode? Nothing jumps out at me in the git log (there's something in the
schannel backend, but I'm on Linux; there's also d5bb459ccf

which
claims to only affect CONNECT-only connections, and all of this is regular
GETs and POSTs)...

On Fri, Sep 11, 2020 at 10:34 PM Ray Satiro via curl-library <
curl-library@cool.haxx.se> wrote:

> On 9/11/2020 2:03 PM, James Brown via curl-library wrote:
>
> After upgrading a test cluster from 7.71.1 to 7.72.0, we're now seeing
> around 0.1% of POSTs from one (and only one) of our applications fail with
> "Failed sending data to the peer" (CURLE_SEND_ERROR) and no other error.
> Based on logs, the request actually succeeds, but libcurl is returning this
> error. This application is using the Ruby Typhoeus wrapper and is itself
> unchanged. The relevant connections are all HTTP/1.1 connections to hosts
> on the local network, and the POSTs are all very small (<1KB) with nothing
> interesting about them.
>
> I haven't had any luck tracking this down since it's such a low fraction
> of requests and is only affecting one of our several hundred applications,
> but it reproducibly happens with 7.72 and not with 7.71.1.
>
> Anyone have any suggestions for how to try to track down the regression? I
> looked at the diff between 7.71.1 and 7.72.0 and no lines containing the
> string "CURLE_SEND_ERROR" were touched, which is unfortunate.
>
>
> There are no similar reports and I looked through the commit history but
> nothing stood out. If you can reliably reproduce then try bisecting it
> https://github.com/curl/curl/wiki/how-to-git-bisect
>
>
> ---
> Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
> Etiquette:   https://curl.haxx.se/mail/etiquette.html



-- 
James Brown
Engineer
---
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette:   https://curl.haxx.se/mail/etiquette.html

Re: Failed sending data to the peer in 7.72.0

2020-09-11 Thread Ray Satiro via curl-library

On 9/11/2020 2:03 PM, James Brown via curl-library wrote:
After upgrading a test cluster from 7.71.1 to 7.72.0, we're now seeing 
around 0.1% of POSTs from one (and only one) of our applications fail 
with "Failed sending data to the peer" (CURLE_SEND_ERROR) and no other 
error. Based on logs, the request actually succeeds, but libcurl is 
returning this error. This application is using the Ruby Typhoeus 
wrapper and is itself unchanged. The relevant connections are all 
HTTP/1.1 connections to hosts on the local network, and the POSTs are 
all very small (<1KB) with nothing interesting about them.


I haven't had any luck tracking this down since it's such a low 
fraction of requests and is only affecting one of our several hundred 
applications, but it reproducibly happens with 7.72 and not with 7.71.1.


Anyone have any suggestions for how to try to track down the 
regression? I looked at the diff between 7.71.1 and 7.72.0 and no 
lines containing the string "CURLE_SEND_ERROR" were touched, which is 
unfortunate.



There are no similar reports and I looked through the commit history but 
nothing stood out. If you can reliably reproduce then try bisecting it 
https://github.com/curl/curl/wiki/how-to-git-bisect



---
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette:   https://curl.haxx.se/mail/etiquette.html

Failed sending data to the peer in 7.72.0

2020-09-11 Thread James Brown via curl-library
After upgrading a test cluster from 7.71.1 to 7.72.0, we're now seeing
around 0.1% of POSTs from one (and only one) of our applications fail with
"Failed sending data to the peer" (CURLE_SEND_ERROR) and no other error.
Based on logs, the request actually succeeds, but libcurl is returning this
error. This application is using the Ruby Typhoeus wrapper and is itself
unchanged. The relevant connections are all HTTP/1.1 connections to hosts
on the local network, and the POSTs are all very small (<1KB) with nothing
interesting about them.

I haven't had any luck tracking this down since it's such a low fraction of
requests and is only affecting one of our several hundred applications, but
it reproducibly happens with 7.72 and not with 7.71.1.

Anyone have any suggestions for how to try to track down the regression? I
looked at the diff between 7.71.1 and 7.72.0 and no lines containing the
string "CURLE_SEND_ERROR" were touched, which is unfortunate.

-- 
James Brown
Engineer
---
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette:   https://curl.haxx.se/mail/etiquette.html