[
https://issues.apache.org/jira/browse/ARROW-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558090#comment-17558090
]
Neal Richardson commented on ARROW-16768:
-----------------------------------------
Some more exploration of the R behavior, which leaves me unsure of how we
should handle this other than with a better error message. Putting NA in the
levels changes the meaning of the data, so we can't just encode it back into
the data.
{code}
# Default: NA goes in the data
f <- factor(c(1, 2, NA))
f
#> [1] 1 2 <NA>
#> Levels: 1 2
is.na(f)
#> [1] FALSE FALSE TRUE
# addNA() moves it from the data to the levels
f2 <- addNA(f)
f2
#> [1] 1 2 <NA>
#> Levels: 1 2 <NA>
# This has semantic changes: NA in the levels is no longer "missing"
is.na(f2)
#> [1] FALSE FALSE FALSE
# You can see this in the underlying data
dput(f)
#> structure(c(1L, 2L, NA), levels = c("1", "2"), class = "factor")
dput(f2)
#> structure(1:3, levels = c("1", "2", NA), class = "factor")
# Can you have NA's in both data and levels?
f3 <- structure(c(1:3, NA), levels = c("1", "2", NA), class = "factor")
f3
#> [1] 1 2 <NA> <NA>
#> Levels: 1 2 <NA>
# This looks like madness
is.na(f3)
#> [1] FALSE FALSE FALSE TRUE
{code}
> [R] Factor levels cannot contain NA
> -----------------------------------
>
> Key: ARROW-16768
> URL: https://issues.apache.org/jira/browse/ARROW-16768
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 7.0.0
> Reporter: Kieran Martin
> Priority: Minor
> Fix For: 9.0.0
>
>
> If you try to write a data frame with a factor with a missing value to
> parquet, you get the error: "Error: Invalid: Cannot insert dictionary values
> containing nulls".
> This seems likely due to how the metadata for factors is currently captured
> in parquet files. Reprex follows:
>
> library(arrow)
> bad_data <- data.frame(A = factor(1, 2, NA))
> write_parquet(bad_data, tempfile())
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)