[ 
https://issues.apache.org/jira/browse/ARROW-17490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane updated ARROW-17490:
---------------------------------
    Description: 
We get different results for dplyr versus Acero if we call log on a column that 
contains 0, i.e.

{code:r}
library(arrow)
library(dplyr)

df <- tibble(x = 0:10)

# In dplyr/base R
df %>%
  mutate(y = log(x)) %>%
  collect()
#> # A tibble: 11 × 2
#>        x        y
#>    <int>    <dbl>
#>  1     0 -Inf    
#>  2     1    0    
#>  3     2    0.693
#>  4     3    1.10 
#>  5     4    1.39 
#>  6     5    1.61 
#>  7     6    1.79 
#>  8     7    1.95 
#>  9     8    2.08 
#> 10     9    2.20 
#> 11    10    2.30

# In Acero
df %>%
  arrow_table() %>%
  mutate(y = log(x)) %>%
  collect()
#> Error in `collect()`:
#> ! Invalid: logarithm of zero
{code}

This is because R defines {{log(0)}} as {{-Inf}} whereas Acero defines it as an 
error.  Not sure what the solution is here; do we want to request the addition 
of an Acero option to define behaviour for this?

  was:
We get different results for dplyr versus Acero if we call log on a column that 
contains 0, i.e.

{code:r}
library(arrow)
library(dplyr)

df <- tibble(x = 0:10)

df %>%
  mutate(y = log(x)) %>%
  collect()
#> # A tibble: 11 × 2
#>        x        y
#>    <int>    <dbl>
#>  1     0 -Inf    
#>  2     1    0    
#>  3     2    0.693
#>  4     3    1.10 
#>  5     4    1.39 
#>  6     5    1.61 
#>  7     6    1.79 
#>  8     7    1.95 
#>  9     8    2.08 
#> 10     9    2.20 
#> 11    10    2.30

df %>%
  arrow_table() %>%
  mutate(y = log(x)) %>%
  collect()
#> Error in `collect()`:
#> ! Invalid: logarithm of zero
{code}

This is because R defines {{log(0)}} as {{-Inf}} whereas Acero defines it as an 
error.  Not sure what the solution is here; do we want to request the addition 
of an Acero option to define behaviour for this?


> [R] Differing results in log bindings
> -------------------------------------
>
>                 Key: ARROW-17490
>                 URL: https://issues.apache.org/jira/browse/ARROW-17490
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Priority: Major
>
> We get different results for dplyr versus Acero if we call log on a column 
> that contains 0, i.e.
> {code:r}
> library(arrow)
> library(dplyr)
> df <- tibble(x = 0:10)
> # In dplyr/base R
> df %>%
>   mutate(y = log(x)) %>%
>   collect()
> #> # A tibble: 11 × 2
> #>        x        y
> #>    <int>    <dbl>
> #>  1     0 -Inf    
> #>  2     1    0    
> #>  3     2    0.693
> #>  4     3    1.10 
> #>  5     4    1.39 
> #>  6     5    1.61 
> #>  7     6    1.79 
> #>  8     7    1.95 
> #>  9     8    2.08 
> #> 10     9    2.20 
> #> 11    10    2.30
> # In Acero
> df %>%
>   arrow_table() %>%
>   mutate(y = log(x)) %>%
>   collect()
> #> Error in `collect()`:
> #> ! Invalid: logarithm of zero
> {code}
> This is because R defines {{log(0)}} as {{-Inf}} whereas Acero defines it as 
> an error.  Not sure what the solution is here; do we want to request the 
> addition of an Acero option to define behaviour for this?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to