HyukjinKwon opened a new pull request #28907:
URL: https://github.com/apache/spark/pull/28907


   ### What changes were proposed in this pull request?
   
   This PR proposes to ignore S4 generic methods under SparkR namespace in 
closure cleaning to support R 4.0.0+.
   
   Currently, when you run the codes that runs R native codes, it fails as 
below with R 4.0.0:
   
   ```r
   df <- createDataFrame(lapply(seq(100), function (e) list(value=e)))
   count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df)))
   ```
   
   ```
   org.apache.spark.SparkException: R unexpectedly exited.
   R worker produced errors: Error in lapply(part, FUN) : attempt to bind a 
variable to R_UnboundValue
   ```
   
   The root cause seems to be related to when an S4 generic method is manually 
included into the closure's environment via `SparkR:::cleanClosure`. For 
example, when an RRDD is created via `createDataFrame` with calling `lapply` to 
convert, `lapply` itself:
   
   
https://github.com/apache/spark/blob/f53d8c63e80172295e2fbc805c0c391bdececcaa/R/pkg/R/RDD.R#L484
   
   is added into the environment of the cleaned closure - because this is not 
an exposed namespace; however, this is broken in R 4.0.0+ for an unknown reason 
with an error message such as "attempt to bind a variable to R_UnboundValue".
   
   Actually, we don't need to add the `lapply` into the environment of the 
closure because it is not supposed to be called in worker side. In fact, there 
is no private generic methods supposed to be called in worker side in SparkR at 
all from my understanding.
   
   Therefore, this PR takes a simpler path to work around just by explicitly 
excluding the S4 generic methods under SparkR namespace to support R 4.0.0. in 
SparkR.
   
   ### Why are the changes needed?
   
   To support R 4.0.0+ with SparkR, and unblock the releases on CRAN. CRAN 
requires the tests pass with the latest R.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it will support R 4.0.0 to end-users.
   
   ### How was this patch tested?
   
   Manually tested.
   
   Note that I tested to build SparkR in R 4.0.0, and run the tests with R 
3.6.3. It all passed. See also [the comment in the 
JIRA](https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142837&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17142837).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to