Sudheer Vinukonda created TS-3794:
-------------------------------------

             Summary: Enhance post SSL handshake read race condition for SPDY 
and H2 scenario
                 Key: TS-3794
                 URL: https://issues.apache.org/jira/browse/TS-3794
             Project: Traffic Server
          Issue Type: Bug
          Components: Core
    Affects Versions: 5.3.0
            Reporter: Sudheer Vinukonda
            Assignee: Sudheer Vinukonda
             Fix For: 6.1.0


An example slow log showing very high *ua_first_read*.

{code}
ERROR: [8624075] Slow Request: client_ip: xx.xx.xx.xxx url: 
http://xxxxxxxxxxxxxxxxxx.... status: 200 unique id: bytes: 86 fd: 0 client 
state: 0 serv
er state: 9 ua_begin: 0.000 ua_first_read: 42.224 ua_read_header_done: 42.224 
cache_open_rea
d_begin: 42.224 cache_open_read_end: 42.224 dns_lookup_begin: 42.224 
dns_lookup_end: 42.224
server_connect: 42.224 server_first_read: 42.229 server_read_header_done: 
42.229 server_clos
e: 42.229 ua_begin_write: 42.229 ua_close: 42.229 sm_finish: 42.229
{code}

Initially, I suspected that it might be caused by browser's connecting early 
before sending any bytes to TS. However, this seems to be happening too 
frequently and with unrealistically high delay between *ua_begin* and 
*ua_first_read*.

I suspect it's caused due to the code that disables the read temporarily before 
calling *TXN_START* hook and re-enables it after the API call out. The read 
disable is done via a 0-byte *do_io_read* on the client vc, but, the problem is 
that a valid *mbuf* is still passed. Based on what I am seeing, it's possible 
this results in actually enabling the *read_vio* all the way uptil 
*ssl_read_from_net* for instance (if there's a race condition and there were 
bytes already from the client resulting in an epoll read event), which would 
finally disable the read since, the *ntodo* (nbytes) is 0. However, this may 
result in the epoll event being lost until a new i/o happens from the client. 
I'm trying out further experiments to confirm the theory. In most cases, the 
read buffer already has some bytes by the time the http session and http sm is 
created, which makes it just work. But, if there's a slight delay in the 
receiving bytes after making a connection, the epoll mechanism should read it, 
but, due to the way the temporary read disable is being done, the event may be 
lost (this is coz, ATS uses the epoll edge triggered mode).

Some history on this issue - 

This issue has been a problem for a long time and affects both http and https 
requests. When this issue was first reported, our router operations team 
eventually closed it indicating that disabling *accept* threads resolved it 
([~yzlai] also reported similar observations and conclusions). It's possible 
that the race condition window may be slightly reduced by disabling accept 
threads, but, to the overall performance and through put, accept threads are 
very important. I now notice that the issue still exists (regardless of whether 
or not accept threads are enabled/disabled) and am testing further to confirm 
the issue.










--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to