[GitHub] [spark] MichaelChirico commented on a change in pull request #29813: [SPARK-32799][R][SQL] Add allowMissingColumns to SparkR unionByName

GitBox Mon, 21 Sep 2020 20:47:44 -0700


MichaelChirico commented on a change in pull request #29813:
URL: https://github.com/apache/spark/pull/29813#discussion_r491705832




##########
File path: R/pkg/R/DataFrame.R
##########
@@ -2863,11 +2863,19 @@ setMethod("unionAll",
 #' \code{UNION ALL} and \code{UNION DISTINCT} in SQL as column positions are 
not taken
 #' into account. Input SparkDataFrames can have different data types in the 
schema.
 #'
+#' When the parameter `allowMissingColumns` is `TRUE`, this function allows

Review comment:
       Grammar revision:
   
   > When the parameter `allowMissingColumns` is `TRUE`, the set of column 
names in `x` and `y` can differ; missing columns will be filled as null. 
Further, the missing columns of `x` will be added at the end in the schema of 
the union result.

##########
File path: R/pkg/R/DataFrame.R
##########
@@ -2863,11 +2863,19 @@ setMethod("unionAll",
 #' \code{UNION ALL} and \code{UNION DISTINCT} in SQL as column positions are 
not taken
 #' into account. Input SparkDataFrames can have different data types in the 
schema.
 #'
+#' When the parameter `allowMissingColumns` is `TRUE`, this function allows
+#' different set of column names between two `SparkDataFrames`.
+#' Missing columns at each side, will be filled with null values.
+#' The missing columns at left `SparkDataFrame` will be added at the end in 
the schema
+#' of the union result.
+#'
 #' Note: This does not remove duplicate rows across the two SparkDataFrames.
 #' This function resolves columns by name (not by position).
 #'
 #' @param x A SparkDataFrame
 #' @param y A SparkDataFrame
+#' @param allowMissingColumns logical
+#' @param ... further arguments to be passed to or from other methods.

Review comment:
       `...` is not actually supported?

##########
File path: R/pkg/R/DataFrame.R
##########
@@ -2880,12 +2888,15 @@ setMethod("unionAll",
 #' df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
 #' df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
 #' head(unionByName(df1, df2))
+#'
+#' df3 <- select(createDataFrame(mtcars), "carb")
+#' head(unionByName(df1, df3))

Review comment:
       an example where `allowMissingColumns=TRUE` would be helpfl

##########
File path: R/pkg/R/DataFrame.R
##########
@@ -2863,11 +2863,19 @@ setMethod("unionAll",
 #' \code{UNION ALL} and \code{UNION DISTINCT} in SQL as column positions are 
not taken
 #' into account. Input SparkDataFrames can have different data types in the 
schema.
 #'
+#' When the parameter `allowMissingColumns` is `TRUE`, this function allows
+#' different set of column names between two `SparkDataFrames`.
+#' Missing columns at each side, will be filled with null values.
+#' The missing columns at left `SparkDataFrame` will be added at the end in 
the schema
+#' of the union result.
+#'
 #' Note: This does not remove duplicate rows across the two SparkDataFrames.
 #' This function resolves columns by name (not by position).
 #'
 #' @param x A SparkDataFrame
 #' @param y A SparkDataFrame
+#' @param allowMissingColumns logical
+#' @param ... further arguments to be passed to or from other methods.

Review comment:
       nvm, seen below it's added to the generic

##########
File path: R/pkg/R/DataFrame.R
##########
@@ -2863,11 +2863,19 @@ setMethod("unionAll",
 #' \code{UNION ALL} and \code{UNION DISTINCT} in SQL as column positions are 
not taken
 #' into account. Input SparkDataFrames can have different data types in the 
schema.
 #'
+#' When the parameter `allowMissingColumns` is `TRUE`, this function allows
+#' different set of column names between two `SparkDataFrames`.
+#' Missing columns at each side, will be filled with null values.
+#' The missing columns at left `SparkDataFrame` will be added at the end in 
the schema
+#' of the union result.
+#'
 #' Note: This does not remove duplicate rows across the two SparkDataFrames.
 #' This function resolves columns by name (not by position).
 #'
 #' @param x A SparkDataFrame
 #' @param y A SparkDataFrame
+#' @param allowMissingColumns logical
+#' @param ... further arguments to be passed to or from other methods.

Review comment:
       The way you've done it looks natural to me

##########
File path: R/pkg/R/DataFrame.R
##########
@@ -2863,11 +2863,19 @@ setMethod("unionAll",
 #' \code{UNION ALL} and \code{UNION DISTINCT} in SQL as column positions are 
not taken
 #' into account. Input SparkDataFrames can have different data types in the 
schema.
 #'
+#' When the parameter `allowMissingColumns` is `TRUE`, this function allows

Review comment:
       good catch, with null is better




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] MichaelChirico commented on a change in pull request #29813: [SPARK-32799][R][SQL] Add allowMissingColumns to SparkR unionByName

Reply via email to