nealrichardson commented on a change in pull request #10780:
URL: https://github.com/apache/arrow/pull/10780#discussion_r674976571



##########
File path: r/tests/testthat/test-duckdb.R
##########
@@ -0,0 +1,192 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+skip_if_not_installed("duckdb")
+library(duckdb)
+library(dplyr)
+
+con <- dbConnect(duckdb::duckdb())
+# we always want to test in parallel
+dbExecute(con, "PRAGMA threads=2")
+
+test_that("basic integration", {

Review comment:
       Isn't this testing only functions in `duckdb`, not `arrow`?

##########
File path: r/R/alchemize.R
##########
@@ -0,0 +1,81 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' Transform a data structure from one engine to another
+#'
+#' The `alchemize_*` family of functions take data in one context (e.g. Arrow
+#' data in an R session, Arrow data in a Python session) and transform it into 
a
+#' form usable by another context: (e.g. Arrow data in a Python session, a
+#' (virtual) table in a DuckDB session). All of these functions use Arrow's
+#' C-interface and data is not serialized or moved when it is alchemized,
+#' instead it is made available for a subprocess of the new context (e.g. 
Python
+#' through reticulate or the DuckDB engine).
+#'
+#' The return value is for each function in the family based on what is at the
+#' end of the function name:
+#'
+#' * `alchemize_to_duckdb` - returns a dbplyr-based `tbl` with the Arrow data
+#' registered as a (virtual) table in DuckDB. The `tbl` can be used in dplyr
+#' pipelines, or you can write DuckDB queries using the table name (by default
+#' `"arrow_"` with numbers following it) given in the `tbl`. If you would like
+#' to use a specific, pre-existent connection to DuckDB use the `con` argument
+#' to pass the connection to use. By default, these tables are automatically
+#' cleaned up when the `tbl` is removed from the session (and garbage 
collection
+#' occurs on that), to disable this, pass `auto_disconnect = FALSE`. *
+#' `alchemize_to_python` - returns a reticulate-based python object. This is 
the
+#' same as the interface using the `r_to_py` functions.
+#'
+#' @param x the object to alchemize
+#' @param ... arguments passed to other functions
+#'
+#' @return An object with a reference to the the alchemized data
+#'
+#' @keywords internal
+#' @name alchemize
+NULL
+
+#' @rdname alchemize
+#' @export
+alchemize_to_duckdb <- function(x, ...) {
+  UseMethod("alchemize_to_duckdb")
+}
+
+#' @rdname alchemize
+#' @export
+alchemize_to_python <- function(x, ...) {
+  UseMethod("alchemize_to_python")
+}
+
+#' @include python.R
+#' @rdname alchemize
+#' @export
+alchemize_to_python.Dataset <- alchemize_to_python.arrow_dplyr_query <- 
function(x, ...) {

Review comment:
       You could define `r_to_py.Dataset` and `py_to_r.Dataset` like this, 
going through the RecordBatchReader (if/when that were useful, which it doesn't 
seem like it is for this issue).

##########
File path: r/R/alchemize.R
##########
@@ -0,0 +1,81 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' Transform a data structure from one engine to another

Review comment:
       I see what you're trying to do here but I'm not convinced that we need 
it (yet, at least). 
   
   What if instead we defined `tbl.arrow_dplyr_query` et al.-- it is a generic 
in `dplyr`--so the tests would look like:
   
   ```r
   ds %>%
     select(int, lgl) %>%
     tbl(.engine = "duckdb") %>%
     summarize(...)
   ```
   

##########
File path: r/R/python.R
##########
@@ -15,6 +15,19 @@
 # specific language governing permissions and limitations
 # under the License.
 
+#' Transfer an Arrow memory structure to Python
+#'
+#' @param x the R-based Arrow object to export to Python
+#' @param ... arguments passed to the methods
+#'
+#' @retuen a reticulate-based Python object
+#'
+#' @keywords internal
+#' @export
+r_to_py <- function(x, ...) {

Review comment:
       I'm not sure this is a good idea because the `r_to_py` generic is in 
`reticulate`

##########
File path: r/R/duckdb.R
##########
@@ -0,0 +1,91 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+arrow_duck_connection <- function() {
+  con <- getOption("arrow_duck_con")
+  if (is.null(con) || !DBI::dbIsValid(con)) {
+    con <- DBI::dbConnect(duckdb::duckdb())
+    # Use the same CPU count that the arrow library is set to
+    DBI::dbExecute(con, paste0("PRAGMA threads=", cpu_count()))
+    options(arrow_duck_con = con)
+  }
+  con
+}
+
+# TODO: note that this is copied from dbplyr

Review comment:
       I wouldn't say "copy" since this is clearly adapted, but do add a link 
to the inspiration




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to