[
https://issues.apache.org/jira/browse/ARROW-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Keane updated ARROW-13189:
-----------------------------------
Summary: [R] Disable row-level metadata application on datasets (was: [R]
Should we be handling row-level metadata at all?)
> [R] Disable row-level metadata application on datasets
> ------------------------------------------------------
>
> Key: ARROW-13189
> URL: https://issues.apache.org/jira/browse/ARROW-13189
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 3.0.0, 4.0.0, 4.0.1
> Reporter: Jonathan Keane
> Assignee: Jonathan Keane
> Priority: Major
> Fix For: 5.0.0
>
>
> In order to support things like SF columns, we have added code that handles
> row-level metadata (https://github.com/apache/arrow/pull/8549 and
> https://github.com/apache/arrow/pull/9182).
> These work just fine in a single table or single parquet file circumstance,
> but when using a dataset (even without filtering!) this can produce some
> surprising (and wrong) results (see reprex below).
> There is already some work underway to make it easier to convert the
> row-element-level attributes to a struct + store it in the column in the
> ARROW-12542 work, but that's still a bit off. But even once that's done,
> should we disable this totally? Stop or ignore+warn that with datasets
> row-level metadata isn't applied (since there's no way for us to get the
> ordering right)? Something else?
> {code:r}
> library(arrow)
> df <- tibble::tibble(
> part = rep(1:2, 13),
> let = letters
> )
> df$embedded_attr <- lapply(seq_len(nrow(df)), function(i) {
> value <- "nothing"
> attributes(value) <- list(letter = df[[i, "let"]])
> value
> })
> df_from_tab <- as.data.frame(Table$create(df))
> # this should be (and is) "b"
> attributes(df_from_tab[df_from_tab$let == "b", "embedded_attr"][[1]][[1]])
> #> $letter
> #> [1] "b"
> # the dfs are the same
> waldo::compare(df, df_from_tab)
> #> ✓ No differences
> # now via dataset
> dir <- "ds-dir"
> write_dataset(df, path = dir, partitioning = "part")
> ds <- open_dataset(dir)
> df_from_ds <- dplyr::collect(ds)
> # this should be (and is not) "b"
> attributes(df_from_ds[df_from_ds$let == "b", "embedded_attr"][[1]][[1]])
> #> $letter
> #> [1] "n"
> # Even controlling for order, the dfs are not the same
> waldo::compare(dplyr::arrange(df, let), dplyr::arrange(df_from_ds, let))
> #> `names(old)`: "part" "let" "embedded_attr"
> #> `names(new)`: "let" "embedded_attr" "part"
> #>
> #> `attr(old$embedded_attr[[2]], 'letter')`: "b"
> #> `attr(new$embedded_attr[[2]], 'letter')`: "n"
> #>
> #> `attr(old$embedded_attr[[3]], 'letter')`: "c"
> #> `attr(new$embedded_attr[[3]], 'letter')`: "b"
> #>
> #> `attr(old$embedded_attr[[4]], 'letter')`: "d"
> #> `attr(new$embedded_attr[[4]], 'letter')`: "o"
> #>
> #> `attr(old$embedded_attr[[5]], 'letter')`: "e"
> #> `attr(new$embedded_attr[[5]], 'letter')`: "c"
> #>
> #> `attr(old$embedded_attr[[6]], 'letter')`: "f"
> #> `attr(new$embedded_attr[[6]], 'letter')`: "p"
> #>
> #> `attr(old$embedded_attr[[7]], 'letter')`: "g"
> #> `attr(new$embedded_attr[[7]], 'letter')`: "d"
> #>
> #> `attr(old$embedded_attr[[8]], 'letter')`: "h"
> #> `attr(new$embedded_attr[[8]], 'letter')`: "q"
> #>
> #> `attr(old$embedded_attr[[9]], 'letter')`: "i"
> #> `attr(new$embedded_attr[[9]], 'letter')`: "e"
> #>
> #> `attr(old$embedded_attr[[10]], 'letter')`: "j"
> #> `attr(new$embedded_attr[[10]], 'letter')`: "r"
> #>
> #> And 15 more differences ...
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)