Adam Black created ARROW-18102:
----------------------------------

             Summary: dplyr::count and dplyr::tally implementation return NA 
instead of 0
                 Key: ARROW-18102
                 URL: https://issues.apache.org/jira/browse/ARROW-18102
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
         Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0
            Reporter: Adam Black


I'm using dplyr with FileSystemDataset objects. The expected behavior is 
similar (or the same as) dataframe behavior. When the FileSystemDataset has 
zero rows dplyr::count and dplyr::tally return NA instead of 0. I would expect 
the result to be 0.

 

``` r
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

path <- tempfile(fileext = ".feather")

zero_row_dataset <- cars %>% filter(dist < 0)

# expected behavior
zero_row_dataset %>% 
  count()
#>   n
#> 1 0

zero_row_dataset %>% 
  tally()
#>   n
#> 1 0

nrow(zero_row_dataset)
#> [1] 0

# now test behavior with a FileSystemDataset
write_feather(zero_row_dataset, path)
ds <- open_dataset(path, format = "feather")
ds
#> FileSystemDataset with 1 Feather file
#> speed: double
#> dist: double
#> 
#> See $metadata for additional Schema metadata

# actual behavior
ds %>% 
  count() %>% 
  collect() # incorrect result
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    NA

ds %>% 
  tally() %>% 
  collect() # incorrect result
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    NA

nrow(ds) # works as expected
#> [1] 0
```

<sup>Created on 2022-10-19 with [reprex 
v2.0.2](https://reprex.tidyverse.org)</sup>



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to