[
https://issues.apache.org/jira/browse/ARROW-18102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicola Crane updated ARROW-18102:
---------------------------------
Description:
I'm using dplyr with FileSystemDataset objects. The expected behavior is
similar (or the same as) dataframe behavior. When the FileSystemDataset has
zero rows dplyr::count and dplyr::tally return NA instead of 0. I would expect
the result to be 0.
{code:r}
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
path <- tempfile(fileext = ".feather")
zero_row_dataset <- cars %>% filter(dist < 0)
# expected behavior
zero_row_dataset %>%
count()
#> n
#> 1 0
zero_row_dataset %>%
tally()
#> n
#> 1 0
nrow(zero_row_dataset)
#> [1] 0
# now test behavior with a FileSystemDataset
write_feather(zero_row_dataset, path)
ds <- open_dataset(path, format = "feather")
ds
#> FileSystemDataset with 1 Feather file
#> speed: double
#> dist: double
#>
#> See $metadata for additional Schema metadata
# actual behavior
ds %>%
count() %>%
collect() # incorrect result
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 NA
ds %>%
tally() %>%
collect() # incorrect result
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 NA
nrow(ds) # works as expected
#> [1] 0
{code}
was:
I'm using dplyr with FileSystemDataset objects. The expected behavior is
similar (or the same as) dataframe behavior. When the FileSystemDataset has
zero rows dplyr::count and dplyr::tally return NA instead of 0. I would expect
the result to be 0.
``` r
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
path <- tempfile(fileext = ".feather")
zero_row_dataset <- cars %>% filter(dist < 0)
# expected behavior
zero_row_dataset %>%
count()
#> n
#> 1 0
zero_row_dataset %>%
tally()
#> n
#> 1 0
nrow(zero_row_dataset)
#> [1] 0
# now test behavior with a FileSystemDataset
write_feather(zero_row_dataset, path)
ds <- open_dataset(path, format = "feather")
ds
#> FileSystemDataset with 1 Feather file
#> speed: double
#> dist: double
#>
#> See $metadata for additional Schema metadata
# actual behavior
ds %>%
count() %>%
collect() # incorrect result
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 NA
ds %>%
tally() %>%
collect() # incorrect result
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 NA
nrow(ds) # works as expected
#> [1] 0
```
<sup>Created on 2022-10-19 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
> [R] dplyr::count and dplyr::tally implementation return NA instead of 0
> -----------------------------------------------------------------------
>
> Key: ARROW-18102
> URL: https://issues.apache.org/jira/browse/ARROW-18102
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Environment: Arrow R package 9.0.0 on Mac OS 12.6 with R 4.2.0
> Reporter: Adam Black
> Priority: Minor
>
> I'm using dplyr with FileSystemDataset objects. The expected behavior is
> similar (or the same as) dataframe behavior. When the FileSystemDataset has
> zero rows dplyr::count and dplyr::tally return NA instead of 0. I would
> expect the result to be 0.
>
> {code:r}
> library(arrow)
> #>
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #>
> #> timestamp
> library(dplyr)
> #>
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #>
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #>
> #> intersect, setdiff, setequal, union
> path <- tempfile(fileext = ".feather")
> zero_row_dataset <- cars %>% filter(dist < 0)
> # expected behavior
> zero_row_dataset %>%
> count()
> #> n
> #> 1 0
> zero_row_dataset %>%
> tally()
> #> n
> #> 1 0
> nrow(zero_row_dataset)
> #> [1] 0
> # now test behavior with a FileSystemDataset
> write_feather(zero_row_dataset, path)
> ds <- open_dataset(path, format = "feather")
> ds
> #> FileSystemDataset with 1 Feather file
> #> speed: double
> #> dist: double
> #>
> #> See $metadata for additional Schema metadata
> # actual behavior
> ds %>%
> count() %>%
> collect() # incorrect result
> #> # A tibble: 1 × 1
> #> n
> #> <int>
> #> 1 NA
> ds %>%
> tally() %>%
> collect() # incorrect result
> #> # A tibble: 1 × 1
> #> n
> #> <int>
> #> 1 NA
> nrow(ds) # works as expected
> #> [1] 0
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)