[jira] [Updated] (TS-4735) Possible deadlock on traffic_server startup

James Peach (JIRA) Wed, 31 Aug 2016 16:10:03 -0700

     [ 
https://issues.apache.org/jira/browse/TS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


James Peach updated TS-4735:
----------------------------
    Description: 
As part of startup, traffic_server creates two threads (to begin with).
1. (main) Thread (1) blocks till its signaled by another thread
1. Thread 2 polls for messages from traffic_manager

It is waiting for a message from traffic_manager which contains all the 
configuration required for it to go ahead with initialization. Hence, it is 
critical that the main Thread (1) wait till it gets the configuration.


Thread 2 which polls for message from traffic_manager works like this:
{noformat}
for(;;) {
  if (pmgmt->require_lm) {     <--- Always True (when using traffic_cop)
    pmgmt->pollLMConnection();  <--- | for (count = 0; count < 10000; count ++) 
                                                           |   num = 
mgmt_read_timeout(...) <---- Blocking call. returns 0 if nothing was received 
for 1 second
                                                           |   if !num: break 
<--- Break out of the loop and return from function 
                                                           |   else: read(fd), 
add_to_event_queue, continue the loop, 
                                                           | Back to fetching 
another message
  }
  pmgmt->processEventQueue();  <--  Process the messages received in 
pollLMConnection()
  pmgmt->processSignalQueue();
  mgmt_sleep_sec(pmgmt->timeout); 
}
{noformat}

RCA:
There are two problems here:
1. If we look into the above code, we should observe that the pollLMConnection 
might not return back for a very long time if it keeps getting messages. As a 
result, we may not call processEventQueue() which processes the received 
messages. And unless we process the messages, we cannot signal the main Thread 
(1) to continue, which is still blocked. Hence we see the issue where 
traffic_server won't complete initialization for a very long time.
2. The second problem is that why is traffic_server receiving so many messages 
at boot-up? The problem lies in the configuration. In 6.2.x, we replaced 
'proxy.process.ssl.total_success_handshake_count' with 
'proxy.process.ssl.total_success_handshake_count_in'. 

In order to provide backwards compatibility, we defined the old stat in 
stats.config.xml. The caveat here is that, since this statconfig is defined in 
stats.config.xml, traffic_manager assumes the responsibility of updating this 
stat. According to the code:
{noformat}
if (i_am_not_owner_of(stat)) : send traffic_server a notify message.
{no format}

Ideally, this code should not be triggered because, traffic_manager does own 
the stat. However, the ownership in the code is determined solely based on the 
'string name'. If the name contains 'process', it is owned by traffic_server. 
This leads to an interesting scenario where traffic_manger keeps updating its 
own stat and sends unnecessary events to traffic_server. These updates happen 
every 1 second (Thanks James for helping me understand this period) which is 
the same as our timeout in traffic_server.  Due to 'Problem 1' we can prevent 
traffic_server from processing any messages for up to 10,000 seconds! (Just 
imagine the case where the message is received just before the timout of 1 
second happens)

I saw this happening with 100% on a VM but 0% on a physical box. I don't have 
any other results as of now though.



  was:
As part of startup, traffic_server creates two threads (to begin with).
1. (main) Thread (1) blocks till its signaled by another thread
1. Thread 2 polls for messages from traffic_manager

It is waiting for a message from traffic_manager which contains all the 
configuration required for it to go ahead with initialization. Hence, it is 
critical that the main Thread (1) wait till it gets the configuration.


Thread 2 which polls for message from traffic_manager works like this:
for(;;) {
  if (pmgmt->require_lm) {     <--- Always True (when using traffic_cop)
    pmgmt->pollLMConnection();  <--- | for (count = 0; count < 10000; count ++) 
                                                           |   num = 
mgmt_read_timeout(...) <---- Blocking call. returns 0 if nothing was received 
for 1 second
                                                           |   if !num: break 
<--- Break out of the loop and return from function 
                                                           |   else: read(fd), 
add_to_event_queue, continue the loop, 
                                                           | Back to fetching 
another message
  }
  pmgmt->processEventQueue();  <--  Process the messages received in 
pollLMConnection()
  pmgmt->processSignalQueue();
  mgmt_sleep_sec(pmgmt->timeout); 
}

RCA:
There are two problems here:
1. If we look into the above code, we should observe that the pollLMConnection 
might not return back for a very long time if it keeps getting messages. As a 
result, we may not call processEventQueue() which processes the received 
messages. And unless we process the messages, we cannot signal the main Thread 
(1) to continue, which is still blocked. Hence we see the issue where 
traffic_server won't complete initialization for a very long time.
2. The second problem is that why is traffic_server receiving so many messages 
at boot-up? The problem lies in the configuration. In 6.2.x, we replaced 
'proxy.process.ssl.total_success_handshake_count' with 
'proxy.process.ssl.total_success_handshake_count_in'. 

In order to provide backwards compatibility, we defined the old stat in 
stats.config.xml. The caveat here is that, since this statconfig is defined in 
stats.config.xml, traffic_manager assumes the responsibility of updating this 
stat. According to the code:
if (i_am_not_owner_of(stat)) : send traffic_server a notify message.

Ideally, this code should not be triggered because, traffic_manager does own 
the stat. However, the ownership in the code is determined solely based on the 
'string name'. If the name contains 'process', it is owned by traffic_server. 
This leads to an interesting scenario where traffic_manger keeps updating its 
own stat and sends unnecessary events to traffic_server. These updates happen 
every 1 second (Thanks James for helping me understand this period) which is 
the same as our timeout in traffic_server.  Due to 'Problem 1' we can prevent 
traffic_server from processing any messages for up to 10,000 seconds! (Just 
imagine the case where the message is received just before the timout of 1 
second happens)

I saw this happening with 100% on a VM but 0% on a physical box. I don't have 
any other results as of now though.




> Possible deadlock on traffic_server startup
> -------------------------------------------
>
>                 Key: TS-4735
>                 URL: https://issues.apache.org/jira/browse/TS-4735
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 6.2.0
>            Reporter: Shrihari
>            Assignee: Shrihari
>             Fix For: 7.0.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As part of startup, traffic_server creates two threads (to begin with).
> 1. (main) Thread (1) blocks till its signaled by another thread
> 1. Thread 2 polls for messages from traffic_manager
> It is waiting for a message from traffic_manager which contains all the 
> configuration required for it to go ahead with initialization. Hence, it is 
> critical that the main Thread (1) wait till it gets the configuration.
> Thread 2 which polls for message from traffic_manager works like this:
> {noformat}
> for(;;) {
>   if (pmgmt->require_lm) {     <--- Always True (when using traffic_cop)
>     pmgmt->pollLMConnection();  <--- | for (count = 0; count < 10000; count 
> ++) 
>                                                            |   num = 
> mgmt_read_timeout(...) <---- Blocking call. returns 0 if nothing was received 
> for 1 second
>                                                            |   if !num: break 
> <--- Break out of the loop and return from function 
>                                                            |   else: 
> read(fd), add_to_event_queue, continue the loop, 
>                                                            | Back to fetching 
> another message
>   }
>   pmgmt->processEventQueue();  <--  Process the messages received in 
> pollLMConnection()
>   pmgmt->processSignalQueue();
>   mgmt_sleep_sec(pmgmt->timeout); 
> }
> {noformat}
> RCA:
> There are two problems here:
> 1. If we look into the above code, we should observe that the 
> pollLMConnection might not return back for a very long time if it keeps 
> getting messages. As a result, we may not call processEventQueue() which 
> processes the received messages. And unless we process the messages, we 
> cannot signal the main Thread (1) to continue, which is still blocked. Hence 
> we see the issue where traffic_server won't complete initialization for a 
> very long time.
> 2. The second problem is that why is traffic_server receiving so many 
> messages at boot-up? The problem lies in the configuration. In 6.2.x, we 
> replaced 
> 'proxy.process.ssl.total_success_handshake_count' with 
> 'proxy.process.ssl.total_success_handshake_count_in'. 
> In order to provide backwards compatibility, we defined the old stat in 
> stats.config.xml. The caveat here is that, since this statconfig is defined 
> in stats.config.xml, traffic_manager assumes the responsibility of updating 
> this stat. According to the code:
> {noformat}
> if (i_am_not_owner_of(stat)) : send traffic_server a notify message.
> {no format}
> Ideally, this code should not be triggered because, traffic_manager does own 
> the stat. However, the ownership in the code is determined solely based on 
> the 'string name'. If the name contains 'process', it is owned by 
> traffic_server. This leads to an interesting scenario where traffic_manger 
> keeps updating its own stat and sends unnecessary events to traffic_server. 
> These updates happen every 1 second (Thanks James for helping me understand 
> this period) which is the same as our timeout in traffic_server.  Due to 
> 'Problem 1' we can prevent traffic_server from processing any messages for up 
> to 10,000 seconds! (Just imagine the case where the message is received just 
> before the timout of 1 second happens)
> I saw this happening with 100% on a VM but 0% on a physical box. I don't have 
> any other results as of now though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TS-4735) Possible deadlock on traffic_server startup

Reply via email to