[jira] [Updated] (SPARK-41937) SparkR datetime column compare with Sys.time() throws error in R (>= 4.2.0)

Vivek Atal (Jira) Sat, 07 Jan 2023 21:54:05 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-41937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vivek Atal updated SPARK-41937:
-------------------------------
    Description: 
Base R 4.2.0 introduced a change ([[Rd] R 4.2.0 is 
released|https://stat.ethz.ch/pipermail/r-announce/2022/000683.html]), 
"{{{}Calling if() or while() with a condition of length greater than one gives 
an error rather than a warning.{}}}"

The below code is a reproducible example of the issue. If it is executed in R 
>=4.2.0 then it will generate an error, or else just a warning message. 
`{{{}Sys.time()`{}}} is a multi-class object in R, and throughout the Spark R 
repository '{{{}if{}}}' statement is used as: `{{{}if(class(x) == 
"Column"){}}}` - this causes error in the latest R version >= 4.2.0. Note that 
R allows an object to have multiple '{{{}class{}}}' names as a character vector 
([R: Object 
Classes|https://stat.ethz.ch/R-manual/R-devel/library/base/html/class.html]); 
hence this type of check itself was not a good idea in the first place.

The below chunks are executed on R version 4.1.3.
{code:java}
{
 SparkR::sparkR.session()
 t <- Sys.time()
 sdf <- SparkR::createDataFrame(data.frame(x = t + c(-1, 1, -1, 1, -1)))
 SparkR::collect(SparkR::filter(sdf, SparkR::column('x') > t))
}
#> Warning in if (class(e2) == 'Column') {: the condition has length > 1 
#> and only the first element will be used
#> x
#> 1 2023-01-07 20:40:20
#> 2 2023-01-07 20:40:20 

{code}
 

 
{code:java}
{
 Sys.setenv(`_R_CHECK_LENGTH_1_CONDITION_` = "true")
 SparkR::sparkR.session()
 t <- Sys.time()
 sdf <- SparkR::createDataFrame(data.frame(x = t + c(-1, 1, -1, 1, -1)))
 SparkR::collect(SparkR::filter(sdf, SparkR::column('x') > t))
}
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' 
#> in selecting a method for function 'collect': error in evaluating the 
#> argument 'condition' in selecting a method for function 'filter': the
#> condition has length > 1 {code}
 

Similar issue is noted for these SparkR functions where {{Sys.time()}} type of 
multi-class data might be used: {{lit, fillna, when, otherwise, contains, 
ifelse }}

The suggested change is to add the `{{{}all{}}}` function (or `{{{}any{}}}`, as 
appropriate) while doing the check of whether `{{{}class(.){}}}` is 
`{{{}Column{}}}` or not: `{{{}if(all(class(.) == "Column")){}}}`. Or, better to 
use `{{{}base::inherits{}}}` for this check as `{{{}if(inherits(., 
"Column")){}}}`.

  was:
Base R 4.2.0 introduced a change ([[Rd] R 4.2.0 is 
released|https://stat.ethz.ch/pipermail/r-announce/2022/000683.html]), 
"{{{}Calling if() or while() with a condition of length greater than one gives 
an error rather than a warning.{}}}"

The below code is a reproducible example of the issue. If it is executed in R 
>=4.2.0 then it will generate an error, or else just a warning message. 
`{{{}Sys.time()`{}}} is a multi-class object in R, and throughout the Spark R 
repository '{{{}if{}}}' statement is used as: `{{{}if(class(x) == 
"Column"){}}}` - this causes error in the latest R version >= 4.2.0. Note that 
R allows an object to have multiple '{{{}class{}}}' names as a character vector 
([R: Object 
Classes|https://stat.ethz.ch/R-manual/R-devel/library/base/html/class.html]); 
hence this type of check itself was not a good idea in the first place.

The below chunks are executed on R version 4.1.3.
{code:java}
{
 SparkR::sparkR.session()
 t <- Sys.time()
 sdf <- SparkR::createDataFrame(data.frame(x = t + c(-1, 1, -1, 1, -1)))
 SparkR::collect(SparkR::filter(sdf, SparkR::column('x') > t))
}
#> Warning in if (class(e2) == 'Column') {: the condition has length > 1 
#> and only the first element will be used
#> x
#> 1 2023-01-07 20:40:20
#> 2 2023-01-07 20:40:20 

{code}
 

 
{code:java}
{
 Sys.setenv(`_R_CHECK_LENGTH_1_CONDITION_` = "true")
 SparkR::sparkR.session()
 t <- Sys.time()
 sdf <- SparkR::createDataFrame(data.frame(x = t + c(-1, 1, -1, 1, -1)))
 SparkR::collect(SparkR::filter(sdf, SparkR::column('x') > t))
}
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' 
#> in selecting a method for function 'collect': error in evaluating the 
#> argument 'condition' in selecting a method for function 'filter': the
#> condition has length > 1 {code}
 

Similar issue is noted for these SparkR functions where {{Sys.time()}} type of 
multi-class data might be used: {{lit, fillna, when, otherwise, contains, 
ifelse }}

The suggested change is to add the `{{{}all{}}}` function (or `{{{}any{}}}`, as 
appropriate) while doing the check of whether `{{{}class(.){}}}` is 
`{{{}Column{}}}` or not: `{{{}if(all(class(.) == "Column")){}}}`. Or, better to 
use `base::inherits` for this check as `if(inherits(., "Column"))`.


> SparkR datetime column compare with Sys.time() throws error in R (>= 4.2.0)
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-41937
>                 URL: https://issues.apache.org/jira/browse/SPARK-41937
>             Project: Spark
>          Issue Type: Bug
>          Components: R, SparkR
>    Affects Versions: 3.3.0
>            Reporter: Vivek Atal
>            Priority: Minor
>              Labels: newbie
>
> Base R 4.2.0 introduced a change ([[Rd] R 4.2.0 is 
> released|https://stat.ethz.ch/pipermail/r-announce/2022/000683.html]), 
> "{{{}Calling if() or while() with a condition of length greater than one 
> gives an error rather than a warning.{}}}"
> The below code is a reproducible example of the issue. If it is executed in R 
> >=4.2.0 then it will generate an error, or else just a warning message. 
> `{{{}Sys.time()`{}}} is a multi-class object in R, and throughout the Spark R 
> repository '{{{}if{}}}' statement is used as: `{{{}if(class(x) == 
> "Column"){}}}` - this causes error in the latest R version >= 4.2.0. Note 
> that R allows an object to have multiple '{{{}class{}}}' names as a character 
> vector ([R: Object 
> Classes|https://stat.ethz.ch/R-manual/R-devel/library/base/html/class.html]); 
> hence this type of check itself was not a good idea in the first place.
> The below chunks are executed on R version 4.1.3.
> {code:java}
> {
>  SparkR::sparkR.session()
>  t <- Sys.time()
>  sdf <- SparkR::createDataFrame(data.frame(x = t + c(-1, 1, -1, 1, -1)))
>  SparkR::collect(SparkR::filter(sdf, SparkR::column('x') > t))
> }
> #> Warning in if (class(e2) == 'Column') {: the condition has length > 1 
> #> and only the first element will be used
> #> x
> #> 1 2023-01-07 20:40:20
> #> 2 2023-01-07 20:40:20 
> {code}
>  
>  
> {code:java}
> {
>  Sys.setenv(`_R_CHECK_LENGTH_1_CONDITION_` = "true")
>  SparkR::sparkR.session()
>  t <- Sys.time()
>  sdf <- SparkR::createDataFrame(data.frame(x = t + c(-1, 1, -1, 1, -1)))
>  SparkR::collect(SparkR::filter(sdf, SparkR::column('x') > t))
> }
> #> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' 
> #> in selecting a method for function 'collect': error in evaluating the 
> #> argument 'condition' in selecting a method for function 'filter': the
> #> condition has length > 1 {code}
>  
> Similar issue is noted for these SparkR functions where {{Sys.time()}} type 
> of multi-class data might be used: {{lit, fillna, when, otherwise, contains, 
> ifelse }}
> The suggested change is to add the `{{{}all{}}}` function (or `{{{}any{}}}`, 
> as appropriate) while doing the check of whether `{{{}class(.){}}}` is 
> `{{{}Column{}}}` or not: `{{{}if(all(class(.) == "Column")){}}}`. Or, better 
> to use `{{{}base::inherits{}}}` for this check as `{{{}if(inherits(., 
> "Column")){}}}`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-41937) SparkR datetime column compare with Sys.time() throws error in R (>= 4.2.0)

Reply via email to