Jonathan Keane created ARROW-13189:
--------------------------------------
Summary: [R] Should we be handling row-level metadata at all?
Key: ARROW-13189
URL: https://issues.apache.org/jira/browse/ARROW-13189
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 4.0.1, 4.0.0, 3.0.0
Reporter: Jonathan Keane
In order to support things like SF columns, we have added code that handles
row-level metadata (https://github.com/apache/arrow/pull/8549 and
https://github.com/apache/arrow/pull/9182).
These work just fine in a single table or single parquet file circumstance, but
when using a dataset (even without filtering!) this can produce some surprising
(and wrong) results (see reprex below).
There is already some work underway to make it easier to convert the
row-element-level attributes to a struct + store it in the column in the
ARROW-12542 work, but that's still a bit off. But even once that's done, should
we disable this totally? Stop or ignore+warn that with datasets row-level
metadata isn't applied (since there's no way for us to get the ordering right)?
Something else?
{code:r}
library(arrow)
df <- tibble::tibble(
part = rep(1:2, 13),
let = letters
)
df$embedded_attr <- lapply(seq_len(nrow(df)), function(i) {
value <- "nothing"
attributes(value) <- list(letter = df[[i, "let"]])
value
})
df_from_tab <- as.data.frame(Table$create(df))
# this should be (and is) "b"
attributes(df_from_tab[df_from_tab$let == "b", "embedded_attr"][[1]][[1]])
#> $letter
#> [1] "b"
# the dfs are the same
waldo::compare(df, df_from_tab)
#> ✓ No differences
# now via dataset
dir <- "ds-dir"
write_dataset(df, path = dir, partitioning = "part")
ds <- open_dataset(dir)
df_from_ds <- dplyr::collect(ds)
# this should be (and is not) "b"
attributes(df_from_ds[df_from_ds$let == "b", "embedded_attr"][[1]][[1]])
#> $letter
#> [1] "n"
# Even controlling for order, the dfs are not the same
waldo::compare(dplyr::arrange(df, let), dplyr::arrange(df_from_ds, let))
#> `names(old)`: "part" "let" "embedded_attr"
#> `names(new)`: "let" "embedded_attr" "part"
#>
#> `attr(old$embedded_attr[[2]], 'letter')`: "b"
#> `attr(new$embedded_attr[[2]], 'letter')`: "n"
#>
#> `attr(old$embedded_attr[[3]], 'letter')`: "c"
#> `attr(new$embedded_attr[[3]], 'letter')`: "b"
#>
#> `attr(old$embedded_attr[[4]], 'letter')`: "d"
#> `attr(new$embedded_attr[[4]], 'letter')`: "o"
#>
#> `attr(old$embedded_attr[[5]], 'letter')`: "e"
#> `attr(new$embedded_attr[[5]], 'letter')`: "c"
#>
#> `attr(old$embedded_attr[[6]], 'letter')`: "f"
#> `attr(new$embedded_attr[[6]], 'letter')`: "p"
#>
#> `attr(old$embedded_attr[[7]], 'letter')`: "g"
#> `attr(new$embedded_attr[[7]], 'letter')`: "d"
#>
#> `attr(old$embedded_attr[[8]], 'letter')`: "h"
#> `attr(new$embedded_attr[[8]], 'letter')`: "q"
#>
#> `attr(old$embedded_attr[[9]], 'letter')`: "i"
#> `attr(new$embedded_attr[[9]], 'letter')`: "e"
#>
#> `attr(old$embedded_attr[[10]], 'letter')`: "j"
#> `attr(new$embedded_attr[[10]], 'letter')`: "r"
#>
#> And 15 more differences ...
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)