[jira] [Updated] (UIMA-5567) UIMA-DUCC: Agent should recover its state after restart

Jerry Cwiklik (JIRA) Mon, 18 Sep 2017 08:10:18 -0700

     [ 
https://issues.apache.org/jira/browse/UIMA-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jerry Cwiklik updated UIMA-5567:
--------------------------------
    Description: 
Currently bouncing an agent is not possible. After launching a child process, 
an agent adds an entry in its Process Inventory and uses a Process handle to 
call waitFor() to detect child termination. When an agent restarts, it looses 
all its children and has no means to recover its inventory.

The proposal is to change this behavior to allow agents to bounce and 
subsequently recover their child processes.  The bounce may be required to 
update agent code for example.

An agent has two options to recover its child processes based on cgroup 
availability.

If cgroups are enabled, an agent on startup will read all PIDs from cgroup.proc 
file. These PIDs reflect running child processes on a node. An agent will 
create a skeleton inventory entry for each PID and fill in the details when the 
OR state is received. The agent will use a PID to find a matching process in 
the OR state. After the new inventory is recovered, the timer based inventory 
update will fetch PIDs from cgroup.proc file again and reconcile this with its 
inventory. To detect child process termination an agent will compare PIDs in 
inventory agains PIDs from cgroup.proc. If a PID is in inventory and not 
present in cgroup.proc, an agent will mark such process as Stopped if 
deallocate flag is true, or will mark it as Failed if deallocate flag is false. 
Any AP process that is no longer running will be marked as Stopped.

If cgroups are not enabled, an agent will recover its inventory from the OR 
state. While in this mode, an agent will disable its Rogue Process Detector and 
not attempt to detect alien processes. The timer based inventory update will 
fetch PIDs from the OS (using ps command) and reconcile this with its 
inventory. To detect child process termination an agent will compare PIDs in 
inventory against PIDs obtained from the OS. If a PID is in inventory and not 
present in the OS, an agent will mark such process as Stopped if deallocate 
flag is true, or will mark it as Failed if deallocate flag is false. Any AP 
process that is no longer running will be marked as Stopped.


- An agent will no longer call waitFor() on a Process object returned from a 
ProcessBuilder when a child process is launched

- An agent will continue to drain stdout and stderr of a child process to 
prevent the child (duccling) from hanging and to receive OS errors which may 
occur when exec'ing a process (bad cmd line, etc).  After duccling calls 
execve(), child process stdout and stderr are redirected to /dev/null and 
nothing is expected from these streams by the agent. 

- A child process will communicate state changes and initialization status to 
an agent via a provided port. Question here is how the port is provided to a 
child. Currently an agent uses -D (or env) to communicate its listener port to 
a child. The port is determined when an agent starts up and can potentially be 
different when an agent is bounced. So we either use a Registry to store 
agent's port for a child to lookup or insist that an agent has a fixed port. If 
an agent is bounced and such port is not available what should happen?

- An agent should support a new flag "-Dclean=[true|false]" which on startup 
will force an agent to clean up (terminate) all child processes found in 
cgroups. The code for doing this is already in place and its a default agent 
procedure on startup. Still a question if this should be a default behavior. 
Also the same flag should control what happens on agent shutdown. If clean= 
true, the agent will terminate its children otherwise child processes will 
remain running.

  was:
Currently bouncing an agent is not possible. After launching a child process, 
an agent adds an entry in its Process Inventory and uses a Process handle to 
call waitFor() to detect child termination. When an agent restarts, it looses 
all its children and has no means to recover its inventory.

The proposal is to change this behavior to allow agents to bounce and 
subsequently recover their child processes.  The bounce may be required to 
update agent code for example.

An agent has two options to recover its child processes based on cgroup 
availability.

If cgroups are enabled, an agent on startup will read all PIDs from cgroup.proc 
file. These PIDs reflect running child processes on a node. An agent will 
create a skeleton inventory entry for each PID and fill in the details when the 
OR state is received. The agent will use a PID to find a matching process in 
the OR state. After the new inventory is recovered, the timer based inventory 
update will fetch PIDs from cgroup.proc file again and reconcile this with its 
inventory. To detect child process termination an agent will compare PIDs in 
inventory agains PIDs from cgroup.proc. If a PID is in inventory and not 
present in cgroup.proc, an agent will mark such process as Stopped if 
deallocate flag is true, or will mark it as Failed if deallocate flag is false. 
Any AP process that is no longer running will be marked as Stopped.

If cgroups are not enabled, an agent will recover its inventory from the OR 
state. While in this mode, an agent will disable its Rogue Process Detector and 
not attempt to detect alien processes. The timer based inventory update will 
fetch PIDs from the OS (using ps command) and reconcile this with its 
inventory. To detect child process termination an agent will compare PIDs in 
inventory against PIDs obtained from the OS. If a PID is in inventory and not 
present in the OS, an agent will mark such process as Stopped if deallocate 
flag is true, or will mark it as Failed if deallocate flag is false. Any AP 
process that is no longer running will be marked as Stopped.


- An agent will no longer call waitFor() on a Process object returned from a 
ProcessBuilder when a child process is launched

- An agent will continue to drain stdout and stderr of a child process to 
prevent the child (duccling) from hanging and to receive OS errors which may 
occur when exec'ing a process (bad cmd line, etc).  After duccling calls 
execve(), child process stdout and stderr are redirected to /dev/null and 
nothing is expected from these streams by the agent. 

- A child process will communicate state changes and initialization status to 
an agent via a provided port. Question here is how the port is provided to a 
child. Currently an agent uses -D (or env) to communicate its listener port to 
a child. The port is determined when an agent starts up and can potentially be 
different when an agent is bounced. So we either use a Registry to store 
agent's port for a child to lookup or insist that an agent has a fixed port. If 
an agent is bounced and such port is not available what should happen?



> UIMA-DUCC: Agent should recover its state after restart
> -------------------------------------------------------
>
>                 Key: UIMA-5567
>                 URL: https://issues.apache.org/jira/browse/UIMA-5567
>             Project: UIMA
>          Issue Type: Improvement
>          Components: DUCC
>            Reporter: Jerry Cwiklik
>            Assignee: Jerry Cwiklik
>             Fix For: future-DUCC
>
>
> Currently bouncing an agent is not possible. After launching a child process, 
> an agent adds an entry in its Process Inventory and uses a Process handle to 
> call waitFor() to detect child termination. When an agent restarts, it looses 
> all its children and has no means to recover its inventory.
> The proposal is to change this behavior to allow agents to bounce and 
> subsequently recover their child processes.  The bounce may be required to 
> update agent code for example.
> An agent has two options to recover its child processes based on cgroup 
> availability.
> If cgroups are enabled, an agent on startup will read all PIDs from 
> cgroup.proc file. These PIDs reflect running child processes on a node. An 
> agent will create a skeleton inventory entry for each PID and fill in the 
> details when the OR state is received. The agent will use a PID to find a 
> matching process in the OR state. After the new inventory is recovered, the 
> timer based inventory update will fetch PIDs from cgroup.proc file again and 
> reconcile this with its inventory. To detect child process termination an 
> agent will compare PIDs in inventory agains PIDs from cgroup.proc. If a PID 
> is in inventory and not present in cgroup.proc, an agent will mark such 
> process as Stopped if deallocate flag is true, or will mark it as Failed if 
> deallocate flag is false. Any AP process that is no longer running will be 
> marked as Stopped.
> If cgroups are not enabled, an agent will recover its inventory from the OR 
> state. While in this mode, an agent will disable its Rogue Process Detector 
> and not attempt to detect alien processes. The timer based inventory update 
> will fetch PIDs from the OS (using ps command) and reconcile this with its 
> inventory. To detect child process termination an agent will compare PIDs in 
> inventory against PIDs obtained from the OS. If a PID is in inventory and not 
> present in the OS, an agent will mark such process as Stopped if deallocate 
> flag is true, or will mark it as Failed if deallocate flag is false. Any AP 
> process that is no longer running will be marked as Stopped.
> - An agent will no longer call waitFor() on a Process object returned from a 
> ProcessBuilder when a child process is launched
> - An agent will continue to drain stdout and stderr of a child process to 
> prevent the child (duccling) from hanging and to receive OS errors which may 
> occur when exec'ing a process (bad cmd line, etc).  After duccling calls 
> execve(), child process stdout and stderr are redirected to /dev/null and 
> nothing is expected from these streams by the agent. 
> - A child process will communicate state changes and initialization status to 
> an agent via a provided port. Question here is how the port is provided to a 
> child. Currently an agent uses -D (or env) to communicate its listener port 
> to a child. The port is determined when an agent starts up and can 
> potentially be different when an agent is bounced. So we either use a 
> Registry to store agent's port for a child to lookup or insist that an agent 
> has a fixed port. If an agent is bounced and such port is not available what 
> should happen?
> - An agent should support a new flag "-Dclean=[true|false]" which on startup 
> will force an agent to clean up (terminate) all child processes found in 
> cgroups. The code for doing this is already in place and its a default agent 
> procedure on startup. Still a question if this should be a default behavior. 
> Also the same flag should control what happens on agent shutdown. If clean= 
> true, the agent will terminate its children otherwise child processes will 
> remain running.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (UIMA-5567) UIMA-DUCC: Agent should recover its state after restart

Reply via email to