[
https://issues.apache.org/jira/browse/TS-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Victor updated TS-3104:
-----------------------
Attachment: ts-0023-cop-reinit-mgr-api-on-failure.patch
ts-0022-fix-lockfile-killgroup.patch
Patches for described issues.
> traffic_cop can't restart traffic_manager properly
> --------------------------------------------------
>
> Key: TS-3104
> URL: https://issues.apache.org/jira/browse/TS-3104
> Project: Traffic Server
> Issue Type: Bug
> Components: Cop
> Reporter: Victor
> Attachments: ts-0022-fix-lockfile-killgroup.patch,
> ts-0023-cop-reinit-mgr-api-on-failure.patch
>
>
> In some cases traffic_cop can't restart traffic_manager properly. We met
> these issues at "Ashmanov and partners" (http://en.ashmanov.com/). There are
> two places in code which in my opinion need corrections:
> 1) The logic which decides whether to kill process or group.
> 2) The main traffic_cop loop: it doesn't reinitialize manager API in case of
> failure and this fact leads to constant attempts to connect to manager using
> socket id == -1.
> I have prepared patches for both issues. Please kindly take a look at them
> and let me know your thoughts.
> diff --git lib/ts/lockfile.cc lib/ts/lockfile.cc
> index f6e9587..dbd7394 100644
> --- lib/ts/lockfile.cc
> +++ lib/ts/lockfile.cc
> @@ -241,6 +241,7 @@ Lockfile::KillGroup(int sig, int initial_sig, const char
> *pname)
> int err;
> pid_t pid;
> pid_t holding_pid;
> + pid_t self = getpid();
>
> err = Open(&holding_pid);
> if (err == 1) // success getting the lock file
> @@ -252,12 +253,20 @@ Lockfile::KillGroup(int sig, int initial_sig, const
> char *pname)
> pid = getpgid(holding_pid);
> } while ((pid < 0) && (errno == EINTR));
>
> - if ((pid < 0) || (pid == getpid()))
> + if ((pid < 0) || (pid == self)) {
> + // Error getting process group,
> + // or we are the group's owner.
> + // Let's kill just holding_pid
> pid = holding_pid;
> -
> - if (pid != 0) {
> + }
> + else if (pid != self) {
> + // We managed to get holding_pid's process group
> + // and this group is not ours.
> // This way, we kill the process_group:
> pid = -pid;
> + }
> +
> + if (pid != 0) {
> // In order to get core files from each process, please
> // set your core_pattern appropriately.
> lockfile_kill_internal(holding_pid, initial_sig, pid, pname, sig);
> diff --git cop/TrafficCop.cc cop/TrafficCop.cc
> index 307270e..56bc6d2 100644
> --- cop/TrafficCop.cc
> +++ cop/TrafficCop.cc
> @@ -59,6 +59,7 @@ static const char COP_TRACE_FILE[] =
> "/tmp/traffic_cop.trace";
>
> #define COP_FATAL LOG_ALERT
> #define COP_WARNING LOG_ERR
> +#define COP_INFO LOG_INFO
> #define COP_DEBUG LOG_DEBUG
>
> Diags * g_diags; // link time dependency
> @@ -131,6 +132,9 @@ static int child_pid = 0;
> static int child_status = 0;
> static int sem_id = 11452;
>
> +// manager API is initialized
> +static bool mgmt_init = false;
> +
> AppVersionInfo appVersionInfo;
>
> static char const localhost[] = "127.0.0.1";
> @@ -1142,6 +1146,7 @@ test_mgmt_cli_port()
>
> if (TSRecordGetString("proxy.config.manager_binary", &val) !=
> TS_ERR_OKAY) {
> cop_log(COP_WARNING, "(cli test) unable to retrieve manager_binary\n");
> + mgmt_init = false;
> ret = -1;
> } else {
> if (strcmp(val, manager_binary) != 0) {
> @@ -1544,7 +1549,6 @@ check_no_run()
> static void*
> check(void *arg)
> {
> - bool mgmt_init = false;
> cop_log_trace("Entering check()\n");
>
> for (;;) {
> @@ -1593,6 +1597,7 @@ check(void *arg)
>
> // We do this after the first round of checks, since the first "check"
> will spawn traffic_manager
> if (!mgmt_init) {
> + cop_log(COP_INFO, "Initializing manager API\n");
> TSInit(Layout::get()->runtimedir,
> static_cast<TSInitOptionT>(TS_MGMT_OPT_NO_EVENTS |
> TS_MGMT_OPT_NO_SOCK_TESTS));
> mgmt_init = true;
> }
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)