GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/18465

    [SPARK-21093][R] Terminate R's worker processes in the parent of R's daemon 
to prevent a leak 

    ## What changes were proposed in this pull request?
    
    This is a retry for #18320. This PR was reverted due to unexpected test 
failures with -10 error code. 
    
    These were unable to reproduce in MacOS, CentOS and Ubuntu but only in 
Jenkins. So, the tests proceeded to verify this and revert the past try.
    
    The new approach was tested in https://github.com/apache/spark/pull/18463.
    
    Test results:
    
    - With the part of suspecious change in the past try 
(https://github.com/apache/spark/pull/18463/commits/466325d3fd353668583f3bde38ae490d9db0b189)
      
      Tests ran 4 times and 2 times passed and 2 time failed.
    
    - Without the part of suspecious change in the past try 
(https://github.com/apache/spark/pull/18463/commits/466325d3fd353668583f3bde38ae490d9db0b189)
    
      Tests ran 5 times and they all passed.
    
    - With this new apprach 
(https://github.com/apache/spark/pull/18463/commits/0a7589c09f53dfc2094497d8d3e59d6407569417)
    
      Tests ran 5 times and they all passed.
    
    
    It looks the cause is as below (see 
https://github.com/apache/spark/pull/18463/commits/466325d3fd353668583f3bde38ae490d9db0b189):
    
    ```diff
    + exitCode <- 1
    ...
    +   data <- parallel:::readChild(child)
    +   if (is.raw(data)) {
    +     if (unserialize(data) == exitCode) {
          ...
    +     }
    +   }
    
    ...
    
    - parallel:::mcexit(0L)
    + parallel:::mcexit(0L, send = exitCode)
    ```
    
    
    Two possibilities I think
    
     - `parallel:::mcexit(.. , send = exitCode)`
       
       https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html
    
       > It sends send to the master (unless NULL) and then shuts down the 
child process.
    
       However, it looks possible that the parent attemps to terminate the 
child right after getting our custom exit code. So, the child gets terminated 
between "send" and "shuts down", failing to exit properly.
    
     - A bug between `parallel:::mcexit(..., send = ...)` and 
`parallel:::readChild`.
    
    
    To resolve this, I simply decided to avoid both possibilities with this new 
approach here 
(https://github.com/HyukjinKwon/spark/commit/9ff89a7859cb9f427fc774f33c3521c7d962b723).
    
    https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/mcfork.html
    
    > `readChild` and `readChildren` return a raw vector with a "pid" attribute 
if data were available, an integer vector of length one with the process ID if 
a child terminated or `NULL` if the child no longer exists (no children at all 
for `readChildren`).
    
    `readChild` returns "an integer vector of length one with the process ID if 
a child terminated" so we can check if it is `integer` and the same selected 
"process ID". I believe this makes sure that the children are exited.
    
    If this send any data to parent, this should be raw bytes and will be 
discarded (and then will try to read the next and check if it is `integer` in 
the next loop).
    
    ## How was this patch tested?
    
    Manual tests and Jenkins tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-21093-retry-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18465.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18465
    
----
commit d5be516190256889e1b1e022af8a778935131f72
Author: hyukjinkwon <[email protected]>
Date:   2017-06-25T18:05:57Z

    [SPARK-21093][R] Terminate R's worker processes in the parent of R's daemon 
to prevent a leak
    
    ## What changes were proposed in this pull request?
    
    `mcfork` in R looks opening a pipe ahead but the existing logic does not 
properly close it when it is executed hot. This leads to the failure of more 
forking due to the limit for number of files open.
    
    This hot execution looks particularly for `gapply`/`gapplyCollect`. For 
unknown reason, this happens more easily in CentOS and could be reproduced in 
Mac too.
    
    All the details are described in 
https://issues.apache.org/jira/browse/SPARK-21093
    
    This PR proposes simply to terminate R's worker processes in the parent of 
R's daemon to prevent a leak.
    
    ## How was this patch tested?
    
    I ran the codes below on both CentOS and Mac with that configuration 
disabled/enabled.
    
    ```r
    df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
    collect(gapply(df, "a", function(key, x) { x }, schema(df)))
    collect(gapply(df, "a", function(key, x) { x }, schema(df)))
    ...  # 30 times
    ```
    
    Also, now it passes R tests on CentOS as below:
    
    ```
    SparkSQL functions: Spark package found in SPARK_HOME: .../spark
    
..............................................................................................................................................................
    
..............................................................................................................................................................
    
..............................................................................................................................................................
    
..............................................................................................................................................................
    
..............................................................................................................................................................
    
....................................................................................................................................
    ```
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #18320 from HyukjinKwon/SPARK-21093.

commit 9ff89a7859cb9f427fc774f33c3521c7d962b723
Author: hyukjinkwon <[email protected]>
Date:   2017-06-29T08:43:30Z

    New approach

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to