HyukjinKwon commented on a change in pull request #23746: [SPARK-26761][SQL][R]
Vectorized R gapply() implementation
URL: https://github.com/apache/spark/pull/23746#discussion_r255849370
##########
File path: R/pkg/R/group.R
##########
@@ -229,6 +229,24 @@ gapplyInternal <- function(x, func, schema) {
if (is.character(schema)) {
schema <- structType(schema)
}
+ arrowEnabled <- sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] ==
"true"
+ if (arrowEnabled) {
+ if (is.null(schema)) {
+ stop(paste0("Arrow optimization does not support gapplyCollect yet.
Please use ",
+ "'collect' and 'gapply' APIs instead."))
Review comment:
@felixcheung, I was double checking one by one and realised that I need some
more fixes for `gapplyCollect()`. Currently, I disabled it when Arrow is
enabled:
```r
> df <- createDataFrame(mtcars)
> gapplyCollect(df,
+ "gear",
+ function(key, group) {
+ data.frame(gear = key[[1]], disp = mean(group$disp) >
group$disp)
+ })
Error in gapplyInternal(x, func, NULL) :
Arrow optimization does not support gapplyCollect yet. Please use
'collect' and 'gapply' APIs instead.
```
I need few line changes (I guess between 10 ~ 20 lines) to support gapply
but let me do this separately with a separate set of tests.
I file a JIRA here, https://issues.apache.org/jira/browse/SPARK-26858. I
will do this too as soon as this PR gets merged.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]