paleolimbot commented on issue #1008:
URL: https://github.com/apache/arrow-adbc/issues/1008#issuecomment-1703614269

   It should also be said that we could also:
   
   - Support dictionary-encoded columns in the drivers. It's a tiny bit of a 
pain for the drivers implemented in C but not impossible (basically 
`ArrowArrayViewGetString(array_view->dictionary, 
ArrowArrayViewGetInt(array_view, i))`.
   - Not use dictionary encoding to represent factors by default when 
converting in nanoarrow. That would be a somewhat large departure from the 
conversions that happen in the arrow package and I think is also slower.
   - Work around this by never using dictionary encoding in adbi when passing a 
data frame to adbcdrivermanager.
   
   Example of how you could avoid dictionary encoding without changes to 
nanoarrow or driver R packages:
   
   ``` r
   library(nanoarrow)
   
   without_dictionary_encoding <- function(x) {
     x$children <- lapply(x$children, without_dictionary_encoding)
     
     if (is.null(x$dictionary)) {
       x
     } else {
       x$dictionary
     }
   }
   
   df <- palmerpenguins::penguins[1]
   (schema <- infer_nanoarrow_schema(df))
   #> <nanoarrow_schema struct>
   #>  $ format    : chr "+s"
   #>  $ name      : chr ""
   #>  $ metadata  : list()
   #>  $ flags     : int 0
   #>  $ children  :List of 1
   #>   ..$ species:<nanoarrow_schema dictionary(int32)<string>>
   #>   .. ..$ format    : chr "i"
   #>   .. ..$ name      : chr "species"
   #>   .. ..$ metadata  : list()
   #>   .. ..$ flags     : int 2
   #>   .. ..$ children  : list()
   #>   .. ..$ dictionary:<nanoarrow_schema string>
   #>   .. .. ..$ format    : chr "u"
   #>   .. .. ..$ name      : chr ""
   #>   .. .. ..$ metadata  : list()
   #>   .. .. ..$ flags     : int 2
   #>   .. .. ..$ children  : list()
   #>   .. .. ..$ dictionary: NULL
   #>  $ dictionary: NULL
   as_nanoarrow_array(df)
   #> <nanoarrow_array struct[344]>
   #>  $ length    : int 344
   #>  $ null_count: int 0
   #>  $ offset    : int 0
   #>  $ buffers   :List of 1
   #>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
   #>  $ children  :List of 1
   #>   ..$ species:<nanoarrow_array dictionary(int32)<string>[344]>
   #>   .. ..$ length    : int 344
   #>   .. ..$ null_count: int 0
   #>   .. ..$ offset    : int 0
   #>   .. ..$ buffers   :List of 2
   #>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
   #>   .. .. ..$ :<nanoarrow_buffer data<int32>[344][1376 b]> `0 0 0 0 0 0 0 0 
0 ...`
   #>   .. ..$ dictionary:<nanoarrow_array string[3]>
   #>   .. .. ..$ length    : int 3
   #>   .. .. ..$ null_count: int 0
   #>   .. .. ..$ offset    : int 0
   #>   .. .. ..$ buffers   :List of 3
   #>   .. .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
   #>   .. .. .. ..$ :<nanoarrow_buffer data_offset<int32>[4][16 b]> `0 6 15 21`
   #>   .. .. .. ..$ :<nanoarrow_buffer data<string>[21 b]> 
`AdelieChinstrapGentoo`
   #>   .. .. ..$ dictionary: NULL
   #>   .. .. ..$ children  : list()
   #>   .. ..$ children  : list()
   #>  $ dictionary: NULL
   as_nanoarrow_array(df, schema = without_dictionary_encoding(schema))
   #> <nanoarrow_array struct[344]>
   #>  $ length    : int 344
   #>  $ null_count: int 0
   #>  $ offset    : int 0
   #>  $ buffers   :List of 1
   #>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
   #>  $ children  :List of 1
   #>   ..$ species:<nanoarrow_array string[344]>
   #>   .. ..$ length    : int 344
   #>   .. ..$ null_count: int 0
   #>   .. ..$ offset    : int 0
   #>   .. ..$ buffers   :List of 3
   #>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
   #>   .. .. ..$ :<nanoarrow_buffer data_offset<int32>[345][1380 b]> `0 6 12 
18 2...`
   #>   .. .. ..$ :<nanoarrow_buffer data<string>[2268 b]> 
`AdelieAdelieAdelieAdel...`
   #>   .. ..$ dictionary: NULL
   #>   .. ..$ children  : list()
   #>  $ dictionary: NULL
   ```
   
   <sup>Created on 2023-09-01 with [reprex 
v2.0.2](https://reprex.tidyverse.org)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to