[
https://issues.apache.org/jira/browse/TS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425667#comment-15425667
]
Leif Hedstrom commented on TS-4735:
-----------------------------------
Are you by chance going to work on a fix for this? If so, let me know and I can
add you as a contributor so we can assign it to you.
> Possible deadlock on traffic_server startup
> -------------------------------------------
>
> Key: TS-4735
> URL: https://issues.apache.org/jira/browse/TS-4735
> Project: Traffic Server
> Issue Type: Bug
> Components: Core
> Affects Versions: 6.2.0
> Reporter: Shrihari
> Fix For: 7.0.0
>
>
> As part of startup, traffic_server creates two threads (to begin with).
> 1. (main) Thread (1) blocks till its signaled by another thread
> 1. Thread 2 polls for messages from traffic_manager
> It is waiting for a message from traffic_manager which contains all the
> configuration required for it to go ahead with initialization. Hence, it is
> critical that the main Thread (1) wait till it gets the configuration.
> Thread 2 which polls for message from traffic_manager works like this:
> for(;;) {
> if (pmgmt->require_lm) { <--- Always True (when using traffic_cop)
> pmgmt->pollLMConnection(); <--- | for (count = 0; count < 10000; count
> ++)
> | num =
> mgmt_read_timeout(...) <---- Blocking call. returns 0 if nothing was received
> for 1 second
> | if !num: break
> <--- Break out of the loop and return from function
> | else:
> read(fd), add_to_event_queue, continue the loop,
> | Back to fetching
> another message
> }
> pmgmt->processEventQueue(); <-- Process the messages received in
> pollLMConnection()
> pmgmt->processSignalQueue();
> mgmt_sleep_sec(pmgmt->timeout);
> }
> RCA:
> There are two problems here:
> 1. If we look into the above code, we should observe that the
> pollLMConnection might not return back for a very long time if it keeps
> getting messages. As a result, we may not call processEventQueue() which
> processes the received messages. And unless we process the messages, we
> cannot signal the main Thread (1) to continue, which is still blocked. Hence
> we see the issue where traffic_server won't complete initialization for a
> very long time.
> 2. The second problem is that why is traffic_server receiving so many
> messages at boot-up? The problem lies in the configuration. In 6.2.x, we
> replaced
> 'proxy.process.ssl.total_success_handshake_count' with
> 'proxy.process.ssl.total_success_handshake_count_in'.
> In order to provide backwards compatibility, we defined the old stat in
> stats.config.xml. The caveat here is that, since this statconfig is defined
> in stats.config.xml, traffic_manager assumes the responsibility of updating
> this stat. According to the code:
> if (i_am_not_owner_of(stat)) : send traffic_server a notify message.
> Ideally, this code should not be triggered because, traffic_manager does own
> the stat. However, the ownership in the code is determined solely based on
> the 'string name'. If the name contains 'process', it is owned by
> traffic_server. This leads to an interesting scenario where traffic_manger
> keeps updating its own stat and sends unnecessary events to traffic_server.
> These updates happen every 1 second (Thanks James for helping me understand
> this period) which is the same as our timeout in traffic_server. Due to
> 'Problem 1' we can prevent traffic_server from processing any messages for up
> to 10,000 seconds! (Just imagine the case where the message is received just
> before the timout of 1 second happens)
> I saw this happening with 100% on a VM but 0% on a physical box. I don't have
> any other results as of now though.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)