[jira] [Comment Edited] (TS-3714) TS seems to stall between ua_begin and ua_first_read on some transactions resulting in high response times.

Susan Hinrichs (JIRA) Mon, 06 Jul 2015 05:03:24 -0700

    [ 
https://issues.apache.org/jira/browse/TS-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614870#comment-14614870
 ]


Susan Hinrichs edited comment on TS-3714 at 7/6/15 12:02 PM:
-------------------------------------------------------------

I think this change needs evaluation in the 5.3/6.0 code base, and so should be 
rolled back for the time being.  The handshake handling code changed with 5.2.  
And we tracked down some similar issues with that code in TS-3424. I think it 
would also be useful to separate the accept thread fix which by my reading of 
the comments here is not ssl-related.

I spent time this weekend and the end of last week trying to reproduce the 
SSL_read over-read described in 
http://openssl.6102.n7.nabble.com/non-blocking-SSL-read-API-problem-td37403.html.
  I assume that this was only a concern in over-reading on the handshake based 
on the proposed fix.  I created a test client 
https://github.com/shinrich/ats-test, which uses ssl-session reuse to create a 
situation where it is likely that ATS reads the first packet of the data 
exchange with the last packet of the handshake from the client.  I verified 
that this over-read happens with fair frequency based on looking at wireshark 
packet captures and by the insertion of a debug message in the 6.0 code base.

The client makes a regular SSL-based HTTPS request.  Then it harvests the 
SSL_SESSION and uses it to launch 100 threads making a session-reuse SSL 
handhshake and HTTPS request.  

I ran the client in loops for quite a while against open source versions of 
5.0.x, 5.3.x, 6.0.0, and my version of master (before the introduction of the 
changes on this issue).  In earlier versions of the test client, I thought that 
I saw client stall more against the 5.0.x version of ATS, but once I fixed 
errors in the ats-test client, I no longer saw those errors.  

I don't doubt that there is an error, but it more subtle than a straightforward 
SSL_read over-read on handhshake.

[~sudheerv] how are you catching the problem in production?  Via slow logs?  
What kind of traffic are you seeing?  Mostly posts?  Large responses?  Would it 
be possible for you to test with 5.3.0?  


was (Author: shinrich):
I think this change needs evaluation in the 5.3/6.0 code base, and so should be 
rolled back for the time being.   I think it would also be useful to separate 
the accept thread fix which by my reading of the comments here is not 
ssl-related.

I spent time this weekend and the end of last week trying to reproduce the 
SSL_read over-read described in 
http://openssl.6102.n7.nabble.com/non-blocking-SSL-read-API-problem-td37403.html.
  I assume that this was only a concern in over-reading on the handshake based 
on the proposed fix.  I created a test client 
https://github.com/shinrich/ats-test, which uses ssl-session reuse to create a 
situation where it is likely that ATS reads the first packet of the data 
exchange with the last packet of the handshake from the client.  I verified 
that this over-read happens with fair frequency based on looking at wireshark 
packet captures and by the insertion of a debug message in the 6.0 code base.

The client makes a regular SSL-based HTTPS request.  Then it harvests the 
SSL_SESSION and uses it to launch 100 threads making a session-reuse SSL 
handhshake and HTTPS request.  

I ran the client in loops for quite a while against open source versions of 
5.0.x, 5.3.x, 6.0.0, and my version of master (before the introduction of the 
changes on this issue).  In earlier versions of the test client, I thought that 
I saw client stall more against the 5.0.x version of ATS, but once I fixed 
errors in the ats-test client, I no longer saw those errors.  

I don't doubt that there is an error, but it more subtle than a straightforward 
SSL_read over-read on handhshake.

[~sudheerv] how are you catching the problem in production?  Via slow logs?  
What kind of traffic are you seeing?  Mostly posts?  Large responses?  Would it 
be possible for you to test with 5.3.0?  

> TS seems to stall between ua_begin and ua_first_read on some transactions 
> resulting in high response times.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: TS-3714
>                 URL: https://issues.apache.org/jira/browse/TS-3714
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 5.3.0
>            Reporter: Sudheer Vinukonda
>            Assignee: Sudheer Vinukonda
>             Fix For: 6.0.0
>
>
> An example slow log showing very high *ua_first_read*.
> {code}
> ERROR: [8624075] Slow Request: client_ip: xx.xx.xx.xxx url: 
> http://xxxxxxxxxxxxxxxxxx.... status: 200 unique id: bytes: 86 fd: 0 client 
> state: 0 serv
> er state: 9 ua_begin: 0.000 ua_first_read: 42.224 ua_read_header_done: 42.224 
> cache_open_rea
> d_begin: 42.224 cache_open_read_end: 42.224 dns_lookup_begin: 42.224 
> dns_lookup_end: 42.224
> server_connect: 42.224 server_first_read: 42.229 server_read_header_done: 
> 42.229 server_clos
> e: 42.229 ua_begin_write: 42.229 ua_close: 42.229 sm_finish: 42.229
> {code}
> Initially, I suspected that it might be caused by browser's connecting early 
> before sending any bytes to TS. However, this seems to be happening too 
> frequently and with unrealistically high delay between *ua_begin* and 
> *ua_first_read*.
> I suspect it's caused due to the code that disables the read temporarily 
> before calling *TXN_START* hook and re-enables it after the API call out. The 
> read disable is done via a 0-byte *do_io_read* on the client vc, but, the 
> problem is that a valid *mbuf* is still passed. Based on what I am seeing, 
> it's possible this results in actually enabling the *read_vio* all the way 
> uptil *ssl_read_from_net* for instance (if there's a race condition and there 
> were bytes already from the client resulting in an epoll read event), which 
> would finally disable the read since, the *ntodo* (nbytes) is 0. However, 
> this may result in the epoll event being lost until a new i/o happens from 
> the client. I'm trying out further experiments to confirm the theory. In most 
> cases, the read buffer already has some bytes by the time the http session 
> and http sm is created, which makes it just work. But, if there's a slight 
> delay in the receiving bytes after making a connection, the epoll mechanism 
> should read it, but, due to the way the temporary read disable is being done, 
> the event may be lost (this is coz, ATS uses the epoll edge triggered mode).
> Some history on this issue - 
> This issue has been a problem for a long time and affects both http and https 
> requests. When this issue was first reported, our router operations team 
> eventually closed it indicating that disabling *accept* threads resolved it 
> ([~yzlai] also reported similar observations and conclusions). It's possible 
> that the race condition window may be slightly reduced by disabling accept 
> threads, but, to the overall performance and through put, accept threads are 
> very important. I now notice that the issue still exists (regardless of 
> whether or not accept threads are enabled/disabled) and am testing further to 
> confirm the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TS-3714) TS seems to stall between ua_begin and ua_first_read on some transactions resulting in high response times.

Reply via email to