[
https://issues.apache.org/jira/browse/TS-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sudheer Vinukonda resolved TS-3226.
-----------------------------------
Resolution: Fixed
The issue is definitely resolved with the fix, although, I am still unclear on
how the fix is working.
> SSL data not read from the socket sometimes causing transactions to timeout
> ---------------------------------------------------------------------------
>
> Key: TS-3226
> URL: https://issues.apache.org/jira/browse/TS-3226
> Project: Traffic Server
> Issue Type: Bug
> Components: SSL
> Affects Versions: 5.1.1
> Reporter: Sudheer Vinukonda
> Assignee: Sudheer Vinukonda
> Fix For: 5.2.0
>
>
> We have had a really long standing problem where some of our origins were
> complaining of receiving POST requests with non-zero content-length header,
> but, no body (or sometimes, partial body). Due to the way our network was
> setup, this problem was not easy to be isolated due to the multiple hops
> along the way. The post body could be lost anywhere along the path (e.g.
> client, dns, routers/vips, edge, data center etc). After a lot of debugging
> and with the help of some custom-built wire traces for SSL, we managed to
> isolate the problem to our ATS hosts running on our edge layer. From the wire
> traces, we could see that, the post body is coming in alright, but is just
> sitting in the socket and not being read by the post ua tunnel producer.
> After further investigation, it seems that the producer is issuing the
> correct do_io_read for the required number of bytes, but, there seems to be a
> bug in the {{SSLNetVConnection::net_read_io}}, where the ntodo is being
> calculated before acquiring the mutex on the read vio.
> https://github.com/apache/trafficserver/blob/master/iocore/net/SSLNetVConnection.cc#L391
> Instrumenting the code with further debug traces showed that, in the failed
> transactions, I am noticing the ntodo being "0" when determined before the
> mutex, whereas the (s->vio.nbytes - s->vio.ndone) is non-zero after the
> mutex. I am not sure to understand how the nbytes on the read vio object can
> be different before acquiring mutex, but, moving the ntodo calculation after
> mutex seems to have resolved the problem. Note that this is how it is done in
> the corresponding function {{read_from_net}} in {{UnixNetVConnection}}.
> Talking to [~amc] on the IRC, it seems that the mutex is needed coz, the
> {{SSLNetVConnection::net_read_io}} could also be triggered by an incoming
> socket data before the {{UnixNetVConnection::do_io_read}} could trigger it
> and that could mess up the nbytes/ndone in the read vio.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)