Vivek Atal created SPARK-41937:
----------------------------------
Summary: SparkR datetime column compare with Sys.time() throws
error in R (>= 4.2.0)
Key: SPARK-41937
URL: https://issues.apache.org/jira/browse/SPARK-41937
Project: Spark
Issue Type: Bug
Components: R, SparkR
Affects Versions: 3.3.0
Reporter: Vivek Atal
Base R 4.2.0 introduced a change ([[Rd] R 4.2.0 is
released|https://stat.ethz.ch/pipermail/r-announce/2022/000683.html]),
"{{{}Calling if() or while() with a condition of length greater than one gives
an error rather than a warning.{}}}"
The below code is a reproducible example of the issue. If it is executed in R
>=4.2.0 then it will generate an error, or else just a warning message.
`{{{}Sys.time()`{}}} is a multi-class object in R, and throughout the Spark R
repository '{{{}if{}}}' statement is used as: `{{{}if(class(x) ==
"Column"){}}}` - this causes error in the latest R version >= 4.2.0. Note that
R allows an object to have multiple '{{{}class{}}}' names as a character vector
([R: Object
Classes|https://stat.ethz.ch/R-manual/R-devel/library/base/html/class.html]);
hence this type of check itself was not a good idea in the first place.
The below chunks are executed on R version 4.1.3.
{code:java}
{
SparkR::sparkR.session()
t <- Sys.time()
sdf <- SparkR::createDataFrame(data.frame(x = t + c(-1, 1, -1, 1, -1)))
SparkR::collect(SparkR::filter(sdf, SparkR::column('x') > t))
}
#> Warning in if (class(e2) == 'Column') {: the condition has length > 1
#> and only the first element will be used
#> x
#> 1 2023-01-07 20:40:20
#> 2 2023-01-07 20:40:20
{code}
{code:java}
{
Sys.setenv(`_R_CHECK_LENGTH_1_CONDITION_` = "true")
SparkR::sparkR.session()
t <- Sys.time()
sdf <- SparkR::createDataFrame(data.frame(x = t + c(-1, 1, -1, 1, -1)))
SparkR::collect(SparkR::filter(sdf, SparkR::column('x') > t))
}
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x'
#> in selecting a method for function 'collect': error in evaluating the
#> argument 'condition' in selecting a method for function 'filter': the
#> condition has length > 1 {code}
Similar issue is noted for these SparkR functions where {{Sys.time()}} type of
multi-class data might be used: {{lit, fillna, when, otherwise, contains,
ifelse }}
The suggested change is to add the `{{{}all{}}}` function (or `{{{}any{}}}`, as
appropriate) while doing the check of whether `{{{}class(.){}}}` is
`{{{}Column{}}}` or not: `{{{}if(all(class(.) == "Column")){}}}`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]