[GitHub] HyukjinKwon commented on a change in pull request #23760: [SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame

GitBox Thu, 14 Feb 2019 17:52:52 -0800

HyukjinKwon commented on a change in pull request #23760: [SPARK-26762][SQL][R] 
Arrow optimization for conversion from Spark DataFrame to R DataFrame
URL: https://github.com/apache/spark/pull/23760#discussion_r257079861


 ##########
 File path: R/pkg/R/DataFrame.R
 ##########
 @@ -1177,11 +1177,67 @@ setMethod("dim",
 setMethod("collect",
           signature(x = "SparkDataFrame"),
           function(x, stringsAsFactors = FALSE) {
+            connectionTimeout <- 
as.numeric(Sys.getenv("SPARKR_BACKEND_CONNECTION_TIMEOUT", "6000"))
+            useArrow <- FALSE
+            arrowEnabled <- 
sparkR.conf("spark.sql.execution.arrow.enabled")[[1]] == "true"
+            if (arrowEnabled) {
+              useArrow <- tryCatch({
+                requireNamespace1 <- requireNamespace
+                if (!requireNamespace1("arrow", quietly = TRUE)) {
+                  stop("'arrow' package should be installed.")
+                }
+                # Currenty Arrow optimization does not support raw for now.
+                # Also, it does not support explicit float type set by users.
+                if (inherits(schema(x), "structType")) {
+                  if (any(sapply(schema(x)$fields(),
 
 Review comment:
   Yes, this was also pointed out by @felixcheung 
(https://github.com/apache/spark/pull/23760#discussion_r256226667).
   
   Thing is, this logic can be deduplicated across Arrow optimization (collect, 
createDataFrame, dapply, and gapply) if I am not mistaken. I plan to 
deduplicate all after finishing the initial implementation of Arrow 
optimization to avoid conflicts to each other - now dapply 
(https://github.com/apache/spark/pull/23787) and this are only left ones.
   
   Here's my plan after the initial implementations - 
https://github.com/apache/spark/pull/23787#issuecomment-463562490.
   
   Actually, struct type and map type should also be restricted - I will do it 
with a set of deduplicated unittests after all initial implementation. Does 
this plan make sense to you? 
   
   Currently, I have been focusing on each implementation without conflicts. I 
promise that I will fix it after this PR and 
https://github.com/apache/spark/pull/23787.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] HyukjinKwon commented on a change in pull request #23760: [SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame

Reply via email to