[GitHub] [arrow] nealrichardson commented on pull request #34825: GH-34775: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame

via GitHub Thu, 06 Apr 2023 11:49:30 -0700


nealrichardson commented on PR #34825:
URL: https://github.com/apache/arrow/pull/34825#issuecomment-1499479934


   Apologies for being late to the discussion here, and forgive me if I'm 
missing something, but it seems like the odd behavior @thisisnic reported is a 
more narrow problem of how the auto-splicing works in `arrow_table() / 
Table$create()`. That was added in #4704, and there was a TODO issue made about 
improving the C++ code around it (#22186), though I don't think this particular 
issue was what they had in mind. This seems worth fixing.
   
   But I'm not sure there's a more general problem to solve; or rather, I'm not 
sure there's a better solution than the one we have that meets our 
requirements. It seems like the constraints we're solving for, ranked by 
priority, are:
   
   1. `as.data.frame(ArrowObject)` should return an object where 
`is.data.frame()` is `TRUE` and can be used by any other R functions. No Arrow 
methods required to work with it.
   2. Fidelity: when a user saves an R data object to Arrow or Parquet, they 
should get the same R object back when they load it, including metadata.
   3. Internal coherence: the same kind of operation in various places should 
yield the same kind of result.
   4. `tbl`s are useful because they print nicer than vanilla `data.frame`s.
   5. Because we want Arrow to have broad adoption and appeal, we want to avoid 
adding hard dependencies.
   
   Our current approach is: when converting a Table to a data.frame, we [set 
the 
class](https://github.com/apache/arrow/blob/main/r/src/array_to_vector.cpp#L1386)
 as [`c("tbl_df", "tbl", 
"data.frame")`](https://github.com/apache/arrow/blob/main/r/src/symbols.cpp#L60).
 instead of just "data.frame" But [if there is R 
metadata](https://github.com/apache/arrow/blob/main/r/R/metadata.R#L99) saved 
that overwrites the class, we use that so that we can faithfully round-trip R 
objects. Leaving aside the autosplice issue, I think that's all that's 
happening. 
   
   We're relying on how S3 dispatch works, such that if you don't have `tibble` 
loaded and print a `c("tbl_df", "tbl", "data.frame")`, it just prints like a 
data.frame, so we can both (a) return tbls and (b) not depend on `tibble` just 
fine. So we can get `#4` and `#5`, but because of `#2`, we sacrifice a small 
amount of `#3`: we get the weaker `inherits(x, "data.frame")` requirement and 
not `identical(class(x), "data.frame")`. I'm of the opinion that this is a 
reasonable tradeoff since tibbles are just subclasses of data.frame--you always 
get back a data.frame. We could push back the other way, but I think we have to 
give up some `#4` as a result. 
   
   To be clear, I think you/we could justify any number of tradeoffs along 
these dimensions, I just wanted to sketch out what I think they are.
   
   A few other specific thoughts:
   
   * I'm not sure you can make `as.data.frame()` always return `class == 
"data.frame"` without either sacrificing roundtrip fidelity or inventing a new 
function that means "take this Arrow Table with metadata and make an R object". 
The latter doesn't sound like something we want to do.
   * The `read_xxx()` functions can't really be strict about `class` without 
losing roundtrip fidelity. For me, that's enough to say `inherits(x, 
"data.frame")` is sufficient and we don't have to always return identical 
class. 
   * Seems like `dplyr::collect()` should always return tibble, regardless of 
other changes you make (I believe it already does, given how the ExecPlan 
evaluation works).
   
   Additional historical context, in case it's useful in understanding how we 
got here:
   
   * https://github.com/apache/arrow/pull/4454
   * https://github.com/apache/arrow/pull/5399
   * https://github.com/apache/arrow/issues/22715
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on pull request #34825: GH-34775: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame

Reply via email to