zero323 commented on a change in pull request #29813:
URL: https://github.com/apache/spark/pull/29813#discussion_r491707650
##########
File path: R/pkg/R/DataFrame.R
##########
@@ -2863,11 +2863,19 @@ setMethod("unionAll",
#' \code{UNION ALL} and \code{UNION DISTINCT} in SQL as column positions are
not taken
#' into account. Input SparkDataFrames can have different data types in the
schema.
#'
+#' When the parameter `allowMissingColumns` is `TRUE`, this function allows
+#' different set of column names between two `SparkDataFrames`.
+#' Missing columns at each side, will be filled with null values.
+#' The missing columns at left `SparkDataFrame` will be added at the end in
the schema
+#' of the union result.
+#'
#' Note: This does not remove duplicate rows across the two SparkDataFrames.
#' This function resolves columns by name (not by position).
#'
#' @param x A SparkDataFrame
#' @param y A SparkDataFrame
+#' @param allowMissingColumns logical
+#' @param ... further arguments to be passed to or from other methods.
Review comment:
That's correct, but I am not sure if there is a better way of handling
that.
Right now we have generic as follows:
```R
setGeneric("unionByName", function(x, y, ...) {
standardGeneric("unionByName") })
```
‒ as far as I am aware this is the convention for handling optional
arguments we use in SparkR.
Technically speaking we could have
```R
setGeneric("unionByName", function(x, y, allowMissingColumns) {
standardGeneric("unionByName") })
```
but then we'd have to support
```R
signature(x = "SparkDataFrame", y = "SparkDataFrame", allowMissingColumns =
"missing")
```
and
```R
signature(x = "SparkDataFrame", y = "SparkDataFrame", allowMissingColumns =
"logical")
```
if I am not mistaken, and in the past I've been told that's too much.
Do I miss something?
##########
File path: R/pkg/R/DataFrame.R
##########
@@ -2880,12 +2888,15 @@ setMethod("unionAll",
#' df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
#' df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
#' head(unionByName(df1, df2))
+#'
+#' df3 <- select(createDataFrame(mtcars), "carb")
+#' head(unionByName(df1, df3))
Review comment:
Thanks, this is suppose to be the one. Fixed.
##########
File path: R/pkg/R/DataFrame.R
##########
@@ -2863,11 +2863,19 @@ setMethod("unionAll",
#' \code{UNION ALL} and \code{UNION DISTINCT} in SQL as column positions are
not taken
#' into account. Input SparkDataFrames can have different data types in the
schema.
#'
+#' When the parameter `allowMissingColumns` is `TRUE`, this function allows
Review comment:
Sounds good, but for consistency it should fixed in Python and Scala as
well.
Should it be
> columns will be filled as null.
or
> columns will be filled with null.
though?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]