Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18320#discussion_r123715868
  
    --- Diff: R/pkg/inst/worker/daemon.R ---
    @@ -30,8 +30,55 @@ port <- as.integer(Sys.getenv("SPARKR_WORKER_PORT"))
     inputCon <- socketConnection(
         port = port, open = "rb", blocking = TRUE, timeout = connectionTimeout)
     
    +# Waits indefinitely for a socket connecion by default.
    +selectTimeout <- NULL
    +
    +# Exit code that children send to the parent to indicate they exited.
    +exitCode <- 1
    +
     while (TRUE) {
    -  ready <- socketSelect(list(inputCon))
    +  ready <- socketSelect(list(inputCon), timeout = selectTimeout)
    +
    +  # Note that the children should be terminated in the parent. If each 
child terminates
    +  # itself, it appears that the resource is not released properly, that 
causes an unexpected
    +  # termination of this daemon due to, for example, running out of file 
descriptors
    +  # (see SPARK-21093). Therefore, the current implementation tries to 
retrieve children
    +  # that are exited (but not terminated) and then sends a kill signal to 
terminate them properly
    +  # in the parent.
    +  #
    +  # There are two paths that it attempts to send a signal to terminate the 
children in the parent.
    +  #
    +  #   1. Every second if any socket connection is not available and if 
there are child workers
    +  #     running.
    +  #   2. Right after a socket connection is available.
    +  #
    +  # In other words, the parent attempts to send the signal to the children 
every second if
    +  # any worker is running or right before launching other worker children 
from the following
    +  # new socket connection.
    +
    +  # Only the process IDs of children sent data to the parent are returned 
below. The children
    +  # send a custom exit code to the parent after being exited and the 
parent tries
    +  # to terminate them only if they sent the exit code.
    +  children <- parallel:::selectChildren(timeout = 0)
    +
    +  if (is.integer(children)) {
    +    lapply(children, function(child) {
    +      # This data should be raw bytes if any data was sent from this child.
    +      # Otherwise, this returns the PID.
    +      data <- parallel:::readChild(child)
    +      if (is.raw(data)) {
    +        # This checks if the data from this child is the exit code that 
indicates an exited child.
    +        if (unserialize(data) == exitCode) {
    +          # If so, we terminate this child.
    +          tools::pskill(child, tools::SIGUSR1)
    +        }
    +      }
    +    })
    +  } else if (is.null(children)) {
    +    # If it is NULL, there are no such children. Waits indefinitely for a 
socket connecion.
    +    selectTimeout <- NULL
    --- End diff --
    
    It becomes `NULL` if there are no children. It returns `TRUE` if there are 
children here. So, we always turn back to wait indefinitely only if there are 
no children at all.
    
    https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/children.html
    
    > TRUE is the timeout was reached, FALSE if an error occurred (e.g., if the 
master process was interrupted) or an integer vector of process IDs with 
children that have data available, or NULL if there are no children.
    
    Would this answer your question (or did I maybe misunderstand) ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to