[
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717079#comment-14717079
]
Indrajit commented on SPARK-6817:
----------------------------------
Here are some suggestions on the proposed API. If the idea is to keep the API
close to R's current primitives, we should avoid
introducing too many new keywords. E.g., dapplyCollect can be expressed as
collect(dapply(...)). Since collect already exists in Spark,
and R users are comfortable with the syntax as part of dplyr, we shoud reuse
the keyword instead of introducing a new function dapplyCollect.
Relying on existing syntax will reduce the learning curve for users. Was
performance the primary intent to introduce dapplyCollect instead of
collect(dapply(...))?
Similarly, can we do away with gapply and gapplyCollect, and express it using
dapply? In R, the function "split" provides grouping
(https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One
should be able to implement "split" using GroupBy in Spark.
"gapply" can then be expressed in terms of dapply and split, and gapplyCollect
will become collect(dapply(..split..)).
Here is a simple example that uses split and lapply in R:
df<-data.frame(city=c("A","B","A","D"), age=c(10,12,23,5))
print(df)
s<-split(df$age, df$city)
lapply(s, mean)
> DataFrame UDFs in R
> -------------------
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
> Issue Type: New Feature
> Components: SparkR, SQL
> Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after
> merging into Spark.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]