[ 
https://issues.apache.org/jira/browse/ARROW-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557556#comment-17557556
 ] 

Kieran Martin commented on ARROW-16768:
---------------------------------------

Hi [~npr] yes you're correct, that R can handle missing values in factors, and 
the correct behaviour probably is that levels shouldn't normally contain 
missing. I encountered this issue when using the Admiral package 
([https://github.com/pharmaverse/admiral)] with some dummy data, and noticed 
that arrow couldn't handle this edge case. I see from this overflow question 
[https://stackoverflow.com/questions/27195956/convert-na-into-a-factor-level] 
which refers to the function [~paleolimbot] implies that at least user wanted 
this! You can get the same behaviour (as per the question) using exclude = NULL 
in the factor argument.

I think either arrow should handle this or error a bit more meaningfully so the 
user can debug a little more easily (it took me quite a bit of detective work 
to determine what exactly was causing the issue!)

> [R] Factor levels cannot contain NA
> -----------------------------------
>
>                 Key: ARROW-16768
>                 URL: https://issues.apache.org/jira/browse/ARROW-16768
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 7.0.0
>            Reporter: Kieran Martin
>            Priority: Minor
>
> If you try to write a data frame with a factor with a missing value to 
> parquet, you get the error: "Error: Invalid: Cannot insert dictionary values 
> containing nulls". 
> This seems likely due to how the metadata for factors is currently captured 
> in parquet files. Reprex follows:
>  
> library(arrow)
> bad_data <- data.frame(A = factor(1, 2, NA))
> write_parquet(bad_data, tempfile())
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to