[
https://issues.apache.org/jira/browse/ARROW-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550987#comment-17550987
]
Neal Richardson commented on ARROW-16768:
-----------------------------------------
Thanks for the report. A couple things to note:
1. factors can have missing values in the data. The issue is that in your
example, you've put an NA into the "labels" argument of {{factor()}}.
{code}
> factor(1, 2, NA)
[1] <NA>
Levels: <NA>
{code}
Assuming you meant all of the arguments passed to {{factor()}} to be data
values, there is no problem because R puts the NA in the data and not in the
levels:
{code}
> factor(c(1, 2, NA))
[1] 1 2 <NA>
Levels: 1 2
{code}
So {{data.frame(A = factor(c(1, 2, NA)))}} writes just fine.
2. The error comes from conversion to Arrow types, prior to sending to the
Parquet writer
{code}
> Array$create(factor(1, labels=NA))
Error: Invalid: Cannot insert dictionary values containing nulls
{code}
raised from here:
https://github.com/apache/arrow/blob/91e3ac53e2e21736ce6291d73fc37da6fa21259d/cpp/src/arrow/array/builder_dict.cc#L81
If there is a real use case where you could get an NA in the factor levels, we
would need to handle that in R.
> Factor variables in R with missing values cause an error for write_parquet
> --------------------------------------------------------------------------
>
> Key: ARROW-16768
> URL: https://issues.apache.org/jira/browse/ARROW-16768
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 7.0.0
> Reporter: Kieran Martin
> Priority: Minor
>
> If you try to write a data frame with a factor with a missing value to
> parquet, you get the error: "Error: Invalid: Cannot insert dictionary values
> containing nulls".
> This seems likely due to how the metadata for factors is currently captured
> in parquet files. Reprex follows:
>
> library(arrow)
> bad_data <- data.frame(A = factor(1, 2, NA))
> write_parquet(bad_data, tempfile())
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)