[GitHub] [arrow] paleolimbot commented on a diff in pull request #37843: GH-37842: [R] Implement infer_schema.data.frame()

via GitHub Wed, 27 Sep 2023 07:03:53 -0700


paleolimbot commented on code in PR #37843:
URL: https://github.com/apache/arrow/pull/37843#discussion_r1338666415



##########
r/tests/testthat/test-schema.R:
##########
@@ -295,7 +295,9 @@ test_that("schema name assignment", {
 
 test_that("schema extraction", {
   skip_if_not_available("dataset")
+
   tbl <- arrow_table(example_data)
+  expect_equal(schema(example_data), tbl$schema)

Review Comment:
   I think it is probably worth testing some edge cases with `data.frame()` 
like (1) at df with duplicate names and (2) a data frame with zero columns!



##########
r/R/schema.R:
##########
@@ -285,6 +285,9 @@ infer_schema.Dataset <- function(x) x$schema
 #' @export
 infer_schema.arrow_dplyr_query <- function(x) implicit_schema(x)
 
+#' @export
+infer_schema.data.frame <- function(x) infer_schema(arrow_table(x))

Review Comment:
   `arrow_table(some_data_frame)` is potentially a very expensive operation 
(possibly requiring a full copy of the data if, for example, they are all 
string columns).
   
   Another option is `infer_schema(arrow_table(x[integer(), , drop = FALSE]))`, 
although this won't quite work for things like `list()` columns where we 
actually need some values to properly infer the type.
   
   Unless I'm missing some prior art in the package, I think `schema(!!! 
lapply(x, infer_type))` might be the way to go.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] paleolimbot commented on a diff in pull request #37843: GH-37842: [R] Implement infer_schema.data.frame()

Reply via email to