[ 
https://issues.apache.org/jira/browse/ARROW-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558090#comment-17558090
 ] 

Neal Richardson commented on ARROW-16768:
-----------------------------------------

Some more exploration of the R behavior, which leaves me unsure of how we 
should handle this other than with a better error message. Putting NA in the 
levels changes the meaning of the data, so we can't just encode it back into 
the data.

{code}
# Default: NA goes in the data
f <- factor(c(1, 2, NA))
f
#> [1] 1    2    <NA>
#> Levels: 1 2
is.na(f)
#> [1] FALSE FALSE  TRUE

# addNA() moves it from the data to the levels
f2 <- addNA(f)
f2
#> [1] 1    2    <NA>
#> Levels: 1 2 <NA>
# This has semantic changes: NA in the levels is no longer "missing"
is.na(f2)
#> [1] FALSE FALSE FALSE

# You can see this in the underlying data
dput(f)
#> structure(c(1L, 2L, NA), levels = c("1", "2"), class = "factor")
dput(f2)
#> structure(1:3, levels = c("1", "2", NA), class = "factor")

# Can you have NA's in both data and levels?
f3 <- structure(c(1:3, NA), levels = c("1", "2", NA), class = "factor")
f3
#> [1] 1    2    <NA> <NA>
#> Levels: 1 2 <NA>
# This looks like madness
is.na(f3)
#> [1] FALSE FALSE FALSE  TRUE
{code}

> [R] Factor levels cannot contain NA
> -----------------------------------
>
>                 Key: ARROW-16768
>                 URL: https://issues.apache.org/jira/browse/ARROW-16768
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 7.0.0
>            Reporter: Kieran Martin
>            Priority: Minor
>             Fix For: 9.0.0
>
>
> If you try to write a data frame with a factor with a missing value to 
> parquet, you get the error: "Error: Invalid: Cannot insert dictionary values 
> containing nulls". 
> This seems likely due to how the metadata for factors is currently captured 
> in parquet files. Reprex follows:
>  
> library(arrow)
> bad_data <- data.frame(A = factor(1, 2, NA))
> write_parquet(bad_data, tempfile())
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to