thisisnic commented on PR #46431:
URL: https://github.com/apache/arrow/pull/46431#issuecomment-2906640189
@amoeba I tried the approach you suggested here but because we use
`as_arrow_table()` internally in a lot more functions, we end up breaking
roundtripping with Feather etc.
I think if we work only in R, we would want to remove the label and then
restore them later, but trying to find an uncomplicated way of doing this.
I think we definitely want to stop the segfault regardless and error instead.
Users technically can use `mutate()` to change the type to something we can
work with, *but* there'll be resource costs with doing this on a dataset. See
my reprex below.
``` r
library(haven)
library(arrow)
library(tibble)
library(dplyr)
d <- tibble(
a = labelled(x = 1:5),
b = labelled(x = 11:15)
)
tf <- tempfile()
write_parquet(d, tf)
# still fails
read_parquet(tf, as_data_frame = FALSE) %>%
filter(a > 3) %>%
collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! NotImplemented: Function 'greater' has no kernel matching input types
(<labelled<integer>[0]>, <labelled<integer>[0]>)
```
``` r
tf <- tempfile()
write_parquet(d, tf)
# works
read_parquet(tf, as_data_frame = FALSE) %>%
mutate(a = as.integer(a)) %>%
filter(a > 3) %>%
collect()
#> # A tibble: 2 × 2
#> a b
#> <int> <int+lbl>
#> 1 4 14
#> 2 5 15
```
``` r
# fails
open_dataset(tf) %>%
mutate(a = as.integer(a)) %>%
filter(a > 3) %>%
collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! NotImplemented: Function 'greater_equal' has no kernel matching input
types (<labelled<integer>[0]>, <labelled<integer>[0]>)
```
``` r
# works but potentially higher resource usage
open_dataset(tf) %>%
mutate(a = as.integer(a)) %>%
compute() %>%
filter(a > 3) %>%
collect()
#> # A tibble: 2 × 2
#> a b
#> <int> <int+lbl>
#> 1 4 14
#> 2 5 15
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]