paleolimbot commented on issue #513:
URL: 
https://github.com/apache/arrow-nanoarrow/issues/513#issuecomment-2157031383

   Thanks for bringing this up!
   
   One of the tricky things about dictionaries in Arrow is that the 
"levels"/"dictionary" live at the array level, not at the type level. This 
means that two arrays can be a `dictionary(int32, string)` but each have its 
own dictionary. Arrow C++ (and therefore arrow R) handles this with a rather 
complex system of "dictionary unification", which it can do because it has 
equality kernels and can do fancy things. nanoarrow doesn't have any of that, 
so I made the default conversion a little simpler (and did it in such a way 
that it handles dictionaries of things that aren't just strings in a more 
predictable way, or at least more stable if unexpected to the average R user).
   
   You should be able to specify that you want a `factor()` specifically, and 
this will work for converting just one batch. If you need to convert an 
arbitrary stream, you'll need to know the levels in advance at the moment (this 
could be fixed such that it "learns" the levels as it goes and finalizes the 
array at the end...basically an implementation of dictionary unification 
written in R).
   
   ``` r
   library(nanoarrow)
   #> Warning: package 'nanoarrow' was built under R version 4.3.3
   
   df1 <- data.frame(
     x = as.factor(letters[1:5]),
     y = as.factor(1:5)
   )
   
   df2 <- data.frame(
     x = as.factor(letters[6:10]),
     y = as.factor(1:5)
   )
   
   # Safest/most type stable/makes the fewest assumptions to just return
   # the dictionary value type
   basic_array_stream(list(df1, df2)) |> 
     convert_array_stream() |> 
     tibble::as_tibble()
   #> # A tibble: 10 × 2
   #>    x     y    
   #>    <chr> <chr>
   #>  1 a     1    
   #>  2 b     2    
   #>  3 c     3    
   #>  4 d     4    
   #>  5 e     5    
   #>  6 f     1    
   #>  7 g     2    
   #>  8 h     3    
   #>  9 i     4    
   #> 10 j     5
   
   # You can specify a factor() target type if you know the levels
   basic_array_stream(list(df1, df2)) |> 
     convert_array_stream(
       data.frame(x = factor(levels = letters), y = factor(levels = 
as.character(1:5)))
     ) |> 
     tibble::as_tibble()
   #> # A tibble: 10 × 2
   #>    x     y    
   #>    <fct> <fct>
   #>  1 a     1    
   #>  2 b     2    
   #>  3 c     3    
   #>  4 d     4    
   #>  5 e     5    
   #>  6 f     1    
   #>  7 g     2    
   #>  8 h     3    
   #>  9 i     4    
   #> 10 j     5
   
   # If you have only one batch, factor() should work as a target (but doesn't 
currently)
   # You can specify a factor() target type if you know the levels
   basic_array_stream(list(df1)) |> 
     convert_array_stream(
       data.frame(x = factor(), y = factor())
     ) |> 
     tibble::as_tibble()
   #> # A tibble: 5 × 2
   #>   x     y    
   #>   <fct> <fct>
   #> 1 a     1    
   #> 2 b     2    
   #> 3 c     3    
   #> 4 d     4    
   #> 5 e     5
   ```
   
   <sup>Created on 2024-06-09 with [reprex 
v2.1.0](https://reprex.tidyverse.org)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to