[ 
https://issues.apache.org/jira/browse/TS-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudheer Vinukonda updated TS-3226:
----------------------------------
    Description: 
We have had a really long standing problem where some of our origins were 
complaining of receiving POST requests with non-zero content-length header, 
but, no body (or sometimes, partial body). Due to the way our network was 
setup, this problem was not easy to be isolated due to the various multiple 
hops along the way. The post body could be lost anywhere along the path (e.g. 
client, dns, routers/vips, edge, data center etc). After a lot of debugging and 
with the help of some custom-built wire traces for SSL, we managed to isolate 
the problem to our ATS hosts running on our edge layer. From the wire traces, 
we could see that, the post body is coming in alright, but is just sitting in 
the socket and not being read by the post ua tunnel producer.

After further investigation, it seems that the producer is issuing the correct 
do_io_read for the required number of bytes, but, there seems to be a bug in 
the {{SSLNetVConnection::net_read_io}}, where the ntodo is being calculated 
before acquiring the mutex on the read vio.

https://github.com/apache/trafficserver/blob/master/iocore/net/SSLNetVConnection.cc#L391

Instrumenting the code with further debug traces showed that, in the failed 
transactions, I am noticing the ntodo being "0" when determined before the 
mutex, whereas the (s->vio.nbytes - s->vio.ndone) is non-zero after the mutex. 
I am not sure to understand how the nbytes on the read vio object can be 
different before acquiring mutex, but, moving the ntodo calculation after mutex 
seems to have resolved the problem. Note that this is how it is done in the 
corresponding function {{read_from_net}} in {{UnixNetVConnection}}.


  was:
We have had a problem where some of our origins were complaining of receiving 
POST requests with non-zero content-length header, but, no body (or sometimes, 
partial body). Due to the way our network was setup, this problem was not easy 
to be isolated due to the various multiple hops along the way. The post body 
could be lost anywhere along the path (e.g. client, dns, routers/vips, edge, 
data center etc). After a lot of debugging and with the help of some 
custom-built wire traces for SSL, we managed to isolate the problem to our ATS 
hosts running on our edge layer. From the wire traces, we could see that, the 
post body is coming in alright, but is just sitting in the socket and not being 
read by the post ua tunnel producer.

After further investigation, it seems that the producer is issuing the correct 
do_io_read for the required number of bytes, but, there seems to be a bug in 
the {{SSLNetVConnection::net_read_io}}, where the ntodo is being calculated 
before acquiring the mutex on the read vio.

https://github.com/apache/trafficserver/blob/master/iocore/net/SSLNetVConnection.cc#L391

Instrumenting the code with further debug traces showed that, in the failed 
transactions, I am noticing the ntodo being "0" when determined before the 
mutex, whereas the (s->vio.nbytes - s->vio.ndone) is non-zero after the mutex. 
I am not sure to understand how the nbytes on the read vio object can be 
different before acquiring mutex, but, moving the ntodo calculation after mutex 
seems to have resolved the problem. Note that this is how it is done in the 
corresponding function {{read_from_net}} in {{UnixNetVConnection}}.



> SSL data not read from the socket sometimes causing transactions to timeout
> ---------------------------------------------------------------------------
>
>                 Key: TS-3226
>                 URL: https://issues.apache.org/jira/browse/TS-3226
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: SSL
>            Reporter: Sudheer Vinukonda
>
> We have had a really long standing problem where some of our origins were 
> complaining of receiving POST requests with non-zero content-length header, 
> but, no body (or sometimes, partial body). Due to the way our network was 
> setup, this problem was not easy to be isolated due to the various multiple 
> hops along the way. The post body could be lost anywhere along the path (e.g. 
> client, dns, routers/vips, edge, data center etc). After a lot of debugging 
> and with the help of some custom-built wire traces for SSL, we managed to 
> isolate the problem to our ATS hosts running on our edge layer. From the wire 
> traces, we could see that, the post body is coming in alright, but is just 
> sitting in the socket and not being read by the post ua tunnel producer.
> After further investigation, it seems that the producer is issuing the 
> correct do_io_read for the required number of bytes, but, there seems to be a 
> bug in the {{SSLNetVConnection::net_read_io}}, where the ntodo is being 
> calculated before acquiring the mutex on the read vio.
> https://github.com/apache/trafficserver/blob/master/iocore/net/SSLNetVConnection.cc#L391
> Instrumenting the code with further debug traces showed that, in the failed 
> transactions, I am noticing the ntodo being "0" when determined before the 
> mutex, whereas the (s->vio.nbytes - s->vio.ndone) is non-zero after the 
> mutex. I am not sure to understand how the nbytes on the read vio object can 
> be different before acquiring mutex, but, moving the ntodo calculation after 
> mutex seems to have resolved the problem. Note that this is how it is done in 
> the corresponding function {{read_from_net}} in {{UnixNetVConnection}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to