[ https://issues.apache.org/jira/browse/TS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425785#comment-15425785 ]
Shrihari commented on TS-4735: ------------------------------ Leif, yes. I am working on a fix for this right now. I am just going over the workflow right now. > Possible deadlock on traffic_server startup > ------------------------------------------- > > Key: TS-4735 > URL: https://issues.apache.org/jira/browse/TS-4735 > Project: Traffic Server > Issue Type: Bug > Components: Core > Affects Versions: 6.2.0 > Reporter: Shrihari > Fix For: 7.0.0 > > > As part of startup, traffic_server creates two threads (to begin with). > 1. (main) Thread (1) blocks till its signaled by another thread > 1. Thread 2 polls for messages from traffic_manager > It is waiting for a message from traffic_manager which contains all the > configuration required for it to go ahead with initialization. Hence, it is > critical that the main Thread (1) wait till it gets the configuration. > Thread 2 which polls for message from traffic_manager works like this: > for(;;) { > if (pmgmt->require_lm) { <--- Always True (when using traffic_cop) > pmgmt->pollLMConnection(); <--- | for (count = 0; count < 10000; count > ++) > | num = > mgmt_read_timeout(...) <---- Blocking call. returns 0 if nothing was received > for 1 second > | if !num: break > <--- Break out of the loop and return from function > | else: > read(fd), add_to_event_queue, continue the loop, > | Back to fetching > another message > } > pmgmt->processEventQueue(); <-- Process the messages received in > pollLMConnection() > pmgmt->processSignalQueue(); > mgmt_sleep_sec(pmgmt->timeout); > } > RCA: > There are two problems here: > 1. If we look into the above code, we should observe that the > pollLMConnection might not return back for a very long time if it keeps > getting messages. As a result, we may not call processEventQueue() which > processes the received messages. And unless we process the messages, we > cannot signal the main Thread (1) to continue, which is still blocked. Hence > we see the issue where traffic_server won't complete initialization for a > very long time. > 2. The second problem is that why is traffic_server receiving so many > messages at boot-up? The problem lies in the configuration. In 6.2.x, we > replaced > 'proxy.process.ssl.total_success_handshake_count' with > 'proxy.process.ssl.total_success_handshake_count_in'. > In order to provide backwards compatibility, we defined the old stat in > stats.config.xml. The caveat here is that, since this statconfig is defined > in stats.config.xml, traffic_manager assumes the responsibility of updating > this stat. According to the code: > if (i_am_not_owner_of(stat)) : send traffic_server a notify message. > Ideally, this code should not be triggered because, traffic_manager does own > the stat. However, the ownership in the code is determined solely based on > the 'string name'. If the name contains 'process', it is owned by > traffic_server. This leads to an interesting scenario where traffic_manger > keeps updating its own stat and sends unnecessary events to traffic_server. > These updates happen every 1 second (Thanks James for helping me understand > this period) which is the same as our timeout in traffic_server. Due to > 'Problem 1' we can prevent traffic_server from processing any messages for up > to 10,000 seconds! (Just imagine the case where the message is received just > before the timout of 1 second happens) > I saw this happening with 100% on a VM but 0% on a physical box. I don't have > any other results as of now though. -- This message was sent by Atlassian JIRA (v6.3.4#6332)