Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/18320#discussion_r123700234
--- Diff: R/pkg/inst/worker/daemon.R ---
@@ -30,8 +30,51 @@ port <- as.integer(Sys.getenv("SPARKR_WORKER_PORT"))
inputCon <- socketConnection(
port = port, open = "rb", blocking = TRUE, timeout = connectionTimeout)
+# Waits indefinitely for a socket connecion by default.
+selectTimeout <- NULL
+
while (TRUE) {
- ready <- socketSelect(list(inputCon))
+ ready <- socketSelect(list(inputCon), timeout = selectTimeout)
+
+ # Note that the children should be terminated in the parent. If each
child terminates
+ # itself, it appears that the resource is not released properly, that
causes an unexpected
+ # termination of this daemon due to, for example, running out of file
descriptors
+ # (see SPARK-21093). Therefore, the current implementation tries to
retrieve children
+ # that are exited (but not terminated) and then sends a kill signal to
terminate them properly
+ # in the parent.
+ #
+ # There are two paths that it attempts to send a signal to terminate the
children in the parent.
+ #
+ # 1. Every second if any socket connection is not available and if
there are child workers
+ # running.
+ # 2. Right after a socket connection is available.
+ #
+ # In other words, the parent attempts to send the signal to the children
every second if
+ # any worker is running or right before launching other worker children
from the following
+ # new socket connection.
+
+ # Only the process IDs of exited children are returned and the
termination is attempted below.
+ children <- parallel:::selectChildren(timeout = 0)
+ if (is.integer(children)) {
+ # If it is PIDs, there are workers exited but not terminated. Attempts
to terminate them
+ # by setting SIGUSR1.
+ lapply(children, function(child) {
--- End diff --
With the change above, it printed:
```
[1] "Wait for 4 seconds to test the last child ..."
[1] "child PID: 86866 and parent will kill given: -1"
[1] "child PID: 86865 and parent will kill given: -1"
[1] "child PID: 86864 and sent a data: arbitrary"
[1] "child PID: 86863 and sent a PID: 86863"
[1] "Wait for 7 seconds more to test the last child ..."
[1] "child PID: 86864 and sent a data: 123"
```
It looks correct.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]