[
https://issues.apache.org/jira/browse/ARROW-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557556#comment-17557556
]
Kieran Martin commented on ARROW-16768:
---------------------------------------
Hi [~npr] yes you're correct, that R can handle missing values in factors, and
the correct behaviour probably is that levels shouldn't normally contain
missing. I encountered this issue when using the Admiral package
([https://github.com/pharmaverse/admiral)] with some dummy data, and noticed
that arrow couldn't handle this edge case. I see from this overflow question
[https://stackoverflow.com/questions/27195956/convert-na-into-a-factor-level]
which refers to the function [~paleolimbot] implies that at least user wanted
this! You can get the same behaviour (as per the question) using exclude = NULL
in the factor argument.
I think either arrow should handle this or error a bit more meaningfully so the
user can debug a little more easily (it took me quite a bit of detective work
to determine what exactly was causing the issue!)
> [R] Factor levels cannot contain NA
> -----------------------------------
>
> Key: ARROW-16768
> URL: https://issues.apache.org/jira/browse/ARROW-16768
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 7.0.0
> Reporter: Kieran Martin
> Priority: Minor
>
> If you try to write a data frame with a factor with a missing value to
> parquet, you get the error: "Error: Invalid: Cannot insert dictionary values
> containing nulls".
> This seems likely due to how the metadata for factors is currently captured
> in parquet files. Reprex follows:
>
> library(arrow)
> bad_data <- data.frame(A = factor(1, 2, NA))
> write_parquet(bad_data, tempfile())
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)