[ 
https://issues.apache.org/jira/browse/ARROW-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550987#comment-17550987
 ] 

Neal Richardson commented on ARROW-16768:
-----------------------------------------

Thanks for the report. A couple things to note:

1. factors can have missing values in the data. The issue is that in your 
example, you've put an NA into the "labels" argument of {{factor()}}. 

{code}
> factor(1, 2, NA)
[1] <NA>
Levels: <NA>
{code}

Assuming you meant all of the arguments passed to {{factor()}} to be data 
values, there is no problem because R puts the NA in the data and not in the 
levels:

{code}
> factor(c(1, 2, NA))
[1] 1    2    <NA>
Levels: 1 2
{code}

So {{data.frame(A = factor(c(1, 2, NA)))}} writes just fine. 

2. The error comes from conversion to Arrow types, prior to sending to the 
Parquet writer

{code}
> Array$create(factor(1, labels=NA))
Error: Invalid: Cannot insert dictionary values containing nulls
{code}

raised from here: 
https://github.com/apache/arrow/blob/91e3ac53e2e21736ce6291d73fc37da6fa21259d/cpp/src/arrow/array/builder_dict.cc#L81

If there is a real use case where you could get an NA in the factor levels, we 
would need to handle that in R.

> Factor variables in R with missing values cause an error for write_parquet
> --------------------------------------------------------------------------
>
>                 Key: ARROW-16768
>                 URL: https://issues.apache.org/jira/browse/ARROW-16768
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 7.0.0
>            Reporter: Kieran Martin
>            Priority: Minor
>
> If you try to write a data frame with a factor with a missing value to 
> parquet, you get the error: "Error: Invalid: Cannot insert dictionary values 
> containing nulls". 
> This seems likely due to how the metadata for factors is currently captured 
> in parquet files. Reprex follows:
>  
> library(arrow)
> bad_data <- data.frame(A = factor(1, 2, NA))
> write_parquet(bad_data, tempfile())
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to