nealrichardson commented on code in PR #12817:
URL: https://github.com/apache/arrow/pull/12817#discussion_r854498397
##########
r/R/array.R:
##########
@@ -217,6 +217,135 @@ Array$create <- function(x, type = NULL) {
Array$import_from_c <- ImportArray
+#' Convert an object to an Arrow Array
+#'
+#' The `as_arrow_array()` function is identical to `Array$create()` except
+#' that it is an S3 generic, which allows methods to be defined in other
+#' packages to convert objects to [Array]. `Array$create()` is slightly faster
+#' because it tries to convert in C++ before falling back on
+#' `as_arrow_array()`.
+#'
+#' @param x An object to convert to an Arrow Array
+#' @param ... Passed to S3 methods
+#' @param type A [type][data-type] for the final Array. A value of `NULL`
+#' will default to the type guessed by [infer_type()].
+#'
+#' @return An [Array] with type `type`.
+#' @export
+#'
+#' @examples
+#' as_arrow_array(1:5)
+#'
+as_arrow_array <- function(x, ..., type = NULL) {
+ UseMethod("as_arrow_array")
+}
+
+#' @export
+as_arrow_array.default <- function(x, ..., type = NULL, from_vec_to_array =
FALSE) {
+ # If from_vec_to_array is TRUE, this is a call from C++ after
+ # trying the internal C++ conversion and S3 dispatch has failed
+ # failed to find a method for the object. This call happens when creating
+ # Array, ChunkedArray, RecordBatch, and Table objects from data.frame
+ # if the internal C++ conversion (faster and can usually be parallelized)
+ # is not implemented. If the C++ call has reached this default method,
+ # we error. If from_vec_to_array is FALSE, we call vec_to_Array to use the
+ # internal C++ conversion.
+ if (from_vec_to_array) {
+ # Last ditch attempt: if vctrs::vec_is(x), we can use the vctrs
+ # extension type.
+ if (vctrs::vec_is(x)) {
+ if (is.null(type)) {
+ vctrs_extension_array(x)
+ } else if (inherits(type, "VctrsExtensionType")) {
+ array <- vctrs_extension_array(
+ x,
+ ptype = type$ptype(),
+ storage_type = type$storage_type()
+ )
+ return(array)
Review Comment:
The other cases don't `return()` so maybe this one doesn't need to either?
```suggestion
vctrs_extension_array(
x,
ptype = type$ptype(),
storage_type = type$storage_type()
)
```
##########
r/R/record-batch-reader.R:
##########
@@ -176,3 +176,62 @@ RecordBatchFileReader$create <- function(file) {
assert_is(file, "InputStream")
ipc___RecordBatchFileReader__Open(file)
}
+
+#' Convert an object to an Arrow RecordBatchReader
+#'
+#' @param x An object to convert to a [RecordBatchReader]
+#' @param ... Passed to S3 methods
+#'
+#' @return A [RecordBatchReader]
+#' @export
+#'
+#' @examplesIf arrow_with_dataset()
+#' reader <- as_record_batch_reader(data.frame(col1 = 1, col2 = "two"))
+#' reader$read_next_batch()
+#'
+as_record_batch_reader <- function(x, ...) {
+ UseMethod("as_record_batch_reader")
+}
+
+#' @rdname as_record_batch_reader
+#' @export
+as_record_batch_reader.RecordBatchReader <- function(x, ...) {
+ x
+}
+
+#' @rdname as_record_batch_reader
+#' @export
+as_record_batch_reader.Table <- function(x, ...) {
+ RecordBatchReader__from_Table(x)
+}
+
+#' @rdname as_record_batch_reader
+#' @export
+as_record_batch_reader.RecordBatch <- function(x, ...) {
+ as_record_batch_reader(as_arrow_table(x))
Review Comment:
You could go directly to the reader
https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L316
though this is probably fine
##########
r/R/record-batch.R:
##########
@@ -189,3 +189,56 @@ record_batch <- RecordBatch$create
#' @export
names.RecordBatch <- function(x) x$names()
+
+#' Convert an object to an Arrow RecordBatch
+#'
+#' Whereas [record_batch()] constructs a [RecordBatch] from one or more
columns,
+#' `as_record_batch()` converts a single object to an Arrow [RecordBatch].
+#'
+#' @param x An object to convert to an Arrow RecordBatch
+#' @param ... Passed to S3 methods
+#' @inheritParams record_batch
+#'
+#' @return A [RecordBatch]
+#' @export
+#'
+#' @examplesIf arrow_available()
+#' as_record_batch(data.frame(col1 = 1, col2 = "two"))
+#'
+as_record_batch <- function(x, ..., schema = NULL) {
+ UseMethod("as_record_batch")
+}
+
+#' @rdname as_record_batch
+#' @export
+as_record_batch.RecordBatch <- function(x, ..., schema = NULL) {
+ if (is.null(schema)) {
+ x
+ } else {
+ x$cast(schema)
+ }
+}
+
+#' @rdname as_record_batch
+#' @export
+as_record_batch.Table <- function(x, ..., schema = NULL) {
+ if (x$num_columns == 0) {
+ batch <- record_batch(data.frame())
+ return(batch$Take(rep_len(0, x$num_rows)))
+ }
+
+ arrays_out <- lapply(x$columns, as_arrow_array)
Review Comment:
He made a jira: ARROW-16239
##########
r/src/arrow_types.h:
##########
@@ -100,6 +127,35 @@ std::shared_ptr<arrow::DataType> InferArrowType(SEXP x);
std::shared_ptr<arrow::Array> vec_to_arrow__reuse_memory(SEXP x);
bool can_reuse_memory(SEXP x, const std::shared_ptr<arrow::DataType>& type);
+// These are the types of objects whose conversion to Arrow Arrays is handled
+// entirely in C++. Other types of objects are converted using the
+// infer_type() S3 generic and the as_arrow_array() S3 generic.
+// For data.frame, we need to recurse because the internal conversion
+// can't accomodate calling into R. If the user specifies a target type
+// and that target type is an ExtensionType, we also can't convert
+// natively (but we check for this separately when it applies).
+static inline bool can_convert_native(SEXP x) {
+ if (!Rf_isObject(x)) {
Review Comment:
Just so I understand: Rf_isObject is true for S4 objects? So anything that
is not an S4 is fine to convert native?
##########
r/R/type.R:
##########
@@ -56,29 +56,64 @@ DataType$import_from_c <- ImportType
INTEGER_TYPES <- as.character(outer(c("uint", "int"), c(8, 16, 32, 64),
paste0))
FLOAT_TYPES <- c("float16", "float32", "float64", "halffloat", "float",
"double")
-#' infer the arrow Array type from an R vector
+#' Infer the arrow Array type from an R object
#'
-#' @param x an R vector
+#' @param x an R object (usually a vector) to be converted to an [Array] or
+#' [ChunkedArray].
+#' @param ... Passed to S3 methods
#'
-#' @return an arrow logical type
+#' @return An arrow [data type][data-type]
#' @examplesIf arrow_available()
-#' type(1:10)
-#' type(1L:10L)
-#' type(c(1, 1.5, 2))
-#' type(c("A", "B", "C"))
-#' type(mtcars)
-#' type(Sys.Date())
-#' @export
-type <- function(x) UseMethod("type")
+#' infer_type(1:10)
+#' infer_type(1L:10L)
+#' infer_type(c(1, 1.5, 2))
+#' infer_type(c("A", "B", "C"))
+#' infer_type(mtcars)
+#' infer_type(Sys.Date())
+#' infer_type(as.POSIXlt(Sys.Date()))
+#' infer_type(vctrs::new_vctr(1:5, class = "my_custom_vctr_class"))
+#' @export
+infer_type <- function(x, ...) UseMethod("infer_type")
+
+#' @rdname infer_type
+#' @export
+type <- function(x) {
+ .Deprecated("infer_type")
+ infer_type(x)
+}
#' @export
-type.default <- function(x) Array__infer_type(x)
+infer_type.default <- function(x, ..., from_array_infer_type = FALSE) {
+ # If from_array_infer_type is TRUE, this is a call from C++ after
+ # checking the internal C++ type inference and S3 dispatch has failed
+ # to find a method for the object. This call happens when
+ # creating Array, ChunkedArray, RecordBatch, and Table objects from
+ # data.frame. If the C++ call has reached this default method,
+ # we error. If from_array_infer_type is FALSE, we call Array__infer_type
+ # to use the internal C++ type inference.
+ if (from_array_infer_type) {
+ # Last ditch attempt: if vctrs::vec_is(x), we can use the vctrs
+ # extension type.
+ if (vctrs::vec_is(x)) {
+ return(vctrs_extension_type(vctrs::vec_ptype(x)))
+ }
+
+ abort(
+ sprintf(
+ "Can't infer Arrow data type from object inheriting from %s",
+ paste(class(x), collapse = " / ")
+ )
+ )
+ } else {
+ Array__infer_type(x)
+ }
+}
#' @export
-type.ArrowDatum <- function(x) x$type
+infer_type.ArrowDatum <- function(x, ...) x$type
#' @export
-type.Expression <- function(x) x$type()
+infer_type.Expression <- function(x, ...) x$type()
Review Comment:
Unfortunate that with ArrowDatum it's an active binding but in Expression
it's a function :/
##########
r/R/type.R:
##########
@@ -56,29 +56,64 @@ DataType$import_from_c <- ImportType
INTEGER_TYPES <- as.character(outer(c("uint", "int"), c(8, 16, 32, 64),
paste0))
FLOAT_TYPES <- c("float16", "float32", "float64", "halffloat", "float",
"double")
-#' infer the arrow Array type from an R vector
+#' Infer the arrow Array type from an R object
#'
-#' @param x an R vector
+#' @param x an R object (usually a vector) to be converted to an [Array] or
+#' [ChunkedArray].
+#' @param ... Passed to S3 methods
#'
-#' @return an arrow logical type
+#' @return An arrow [data type][data-type]
#' @examplesIf arrow_available()
-#' type(1:10)
-#' type(1L:10L)
-#' type(c(1, 1.5, 2))
-#' type(c("A", "B", "C"))
-#' type(mtcars)
-#' type(Sys.Date())
-#' @export
-type <- function(x) UseMethod("type")
+#' infer_type(1:10)
+#' infer_type(1L:10L)
+#' infer_type(c(1, 1.5, 2))
+#' infer_type(c("A", "B", "C"))
+#' infer_type(mtcars)
+#' infer_type(Sys.Date())
+#' infer_type(as.POSIXlt(Sys.Date()))
+#' infer_type(vctrs::new_vctr(1:5, class = "my_custom_vctr_class"))
+#' @export
+infer_type <- function(x, ...) UseMethod("infer_type")
+
+#' @rdname infer_type
+#' @export
+type <- function(x) {
Review Comment:
Should we move this to deprecated.R? (Alternatively, if we don't think that
file is useful, we should just delete it and its contents, they've been in
deprecation for long enough.)
##########
r/R/array.R:
##########
@@ -217,6 +217,125 @@ Array$create <- function(x, type = NULL) {
Array$import_from_c <- ImportArray
+#' Convert an object to an Arrow Array
+#'
+#' Whereas `Array$create()` constructs an [Array] from the built-in data types
+#' for which the Arrow package implements fast converters, `as_arrow_array()`
+#' provides a means by which other packages can define conversions to Arrow
+#' objects.
+#'
+#' @param x An object to convert to an Arrow Array
+#' @param ... Passed to S3 methods
+#' @param type A [type][data-type] for the final Array. A value of `NULL`
+#' will default to the type guessed by [type()].
+#'
+#' @return An [Array].
+#' @export
+#'
+#' @examplesIf arrow_available()
+#' as_arrow_array(1:5)
+#'
+as_arrow_array <- function(x, ..., type = NULL) {
+ UseMethod("as_arrow_array")
+}
+
+#' @rdname as_arrow_array
+#' @export
+as_arrow_array.Array <- function(x, ..., type = NULL) {
+ if (is.null(type)) {
+ x
+ } else {
+ x$cast(type)
+ }
+}
+
+#' @rdname as_arrow_array
+#' @export
+as_arrow_array.ChunkedArray <- function(x, ..., type = NULL) {
+ concat_arrays(!!! x$chunks, type = type)
+}
+
+#' @rdname as_arrow_array
+#' @export
+as_arrow_array.vctrs_vctr <- function(x, ..., type = NULL) {
+ if (is.null(type)) {
+ vctrs_extension_array(x)
+ } else if (inherits(type, "VctrsExtensionType")) {
+ vctrs_extension_array(
+ x,
+ ptype = type$ptype(),
+ storage_type = type$storage_type()
+ )
+ } else {
+ stop_cant_convert_array(x, type)
+ }
+}
+
+#' @export
+as_arrow_array.POSIXlt <- function(x, ..., type = NULL) {
+ as_arrow_array.vctrs_vctr(x, ..., type = type)
+}
+
+#' @export
+as_arrow_array.data.frame <- function(x, ..., type = NULL) {
+ type <- type %||% infer_type(x)
+
+ if (inherits(type, "VctrsExtensionType")) {
+ storage <- as_arrow_array(x, type = type$storage_type())
+ new_extension_array(storage, type)
+ } else if (inherits(type, "StructType")) {
+ fields <- type$fields()
+ names <- map_chr(fields, "name")
+ types <- map(fields, "type")
+ arrays <- Map(as_arrow_array, x, types)
+ names(arrays) <- names
+
+ # ...because there isn't a StructArray$create() yet
Review Comment:
Ok. Since using the C interface here is clearly a hack, and apparently there
is a reason to add a StructArray$create method, could you please make (or find,
it could already exist) a jira and link the id here as a TODO?
##########
r/R/python.R:
##########
@@ -105,19 +105,25 @@ py_to_r.pyarrow.lib.ChunkedArray <- function(x, ...) {
}
r_to_py.Table <- function(x, convert = FALSE) {
- # Import with convert = FALSE so that `_import_from_c` returns a Python
object
- pa <- reticulate::import("pyarrow", convert = FALSE)
- out <- pa$Table$from_arrays(x$columns, schema = x$schema)
- # But set the convert attribute on the return object to the requested value
+ # Going through RecordBatchReader maintains schema metadata (e.g.,
Review Comment:
This change is clever, though I have two concerns:
1. Why would we lose schema metadata the other way? We pass the schema
through the C interface. If it is losing metadata, that sounds like a bug to
fix.
2. I'm not sure how the Table to RecordBatchReader code works but we should
make sure that it's not allocating/concatenating Arrays so that the chunks of
the ChunkedArrays line up into RecordBatches. If it did, that would be
suboptimal I haven't read the source but it would be worth a look.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]