paleolimbot commented on issue #513:
URL:
https://github.com/apache/arrow-nanoarrow/issues/513#issuecomment-2157031383
Thanks for bringing this up!
One of the tricky things about dictionaries in Arrow is that the
"levels"/"dictionary" live at the array level, not at the type level. This
means that two arrays can be a `dictionary(int32, string)` but each have its
own dictionary. Arrow C++ (and therefore arrow R) handles this with a rather
complex system of "dictionary unification", which it can do because it has
equality kernels and can do fancy things. nanoarrow doesn't have any of that,
so I made the default conversion a little simpler (and did it in such a way
that it handles dictionaries of things that aren't just strings in a more
predictable way, or at least more stable if unexpected to the average R user).
You should be able to specify that you want a `factor()` specifically, and
this will work for converting just one batch. If you need to convert an
arbitrary stream, you'll need to know the levels in advance at the moment (this
could be fixed such that it "learns" the levels as it goes and finalizes the
array at the end...basically an implementation of dictionary unification
written in R).
``` r
library(nanoarrow)
#> Warning: package 'nanoarrow' was built under R version 4.3.3
df1 <- data.frame(
x = as.factor(letters[1:5]),
y = as.factor(1:5)
)
df2 <- data.frame(
x = as.factor(letters[6:10]),
y = as.factor(1:5)
)
# Safest/most type stable/makes the fewest assumptions to just return
# the dictionary value type
basic_array_stream(list(df1, df2)) |>
convert_array_stream() |>
tibble::as_tibble()
#> # A tibble: 10 × 2
#> x y
#> <chr> <chr>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 1
#> 7 g 2
#> 8 h 3
#> 9 i 4
#> 10 j 5
# You can specify a factor() target type if you know the levels
basic_array_stream(list(df1, df2)) |>
convert_array_stream(
data.frame(x = factor(levels = letters), y = factor(levels =
as.character(1:5)))
) |>
tibble::as_tibble()
#> # A tibble: 10 × 2
#> x y
#> <fct> <fct>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 1
#> 7 g 2
#> 8 h 3
#> 9 i 4
#> 10 j 5
# If you have only one batch, factor() should work as a target (but doesn't
currently)
# You can specify a factor() target type if you know the levels
basic_array_stream(list(df1)) |>
convert_array_stream(
data.frame(x = factor(), y = factor())
) |>
tibble::as_tibble()
#> # A tibble: 5 × 2
#> x y
#> <fct> <fct>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
```
<sup>Created on 2024-06-09 with [reprex
v2.1.0](https://reprex.tidyverse.org)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]