nealrichardson commented on a change in pull request #10056:
URL: https://github.com/apache/arrow/pull/10056#discussion_r616985185
##########
File path: r/R/arrow-tabular.R
##########
@@ -211,6 +211,25 @@ head.ArrowTabular <- head.ArrowDatum
#' @export
tail.ArrowTabular <- tail.ArrowDatum
+#' @export
+na.fail.ArrowTabular <- function(object, ...){
+ na_count <- sum(purrr::map_int(object$columns, ~.x$null_count))
+ if(na_count > 0){
+ stop("missing values in object")
+ }
+ object
+}
+
+#' @export
+na.omit.ArrowTabular <- function(object, ...){
+ not_na <- purrr::map(object$columns, ~!is.na(.x))
+ not_na_agg <- purrr::reduce(not_na, `&`)
+ object$Filter(not_na_agg)
Review comment:
A few thoughts:
1. It would be nice to build up a single array_expression rather than
eagerly evaluate each part. It doesn't change anything today, but soon it will
allow some optimizations in the Arrow C++ library, and it would help if we're
already set up for that.
2. Building the expression also lets you use "is_valid", which is more
efficient than `negate(is_null(x))`
3. This seems like the kind of thing that should ultimately be implemented
in the C++ library--it's useful and there are optimizations to be made by
dealing with this at a lower level. Can you please make a JIRA?
4. We import `purrr::map` so you don't need to `::` it.
```suggestion
not_na <- map(object$columns, ~build_array_expression("is_valid", .x))
not_na_agg <- Reduce("&", not_na)
object$Filter(eval_array_expression(not_na_agg))
```
##########
File path: r/R/arrow-tabular.R
##########
@@ -211,6 +211,25 @@ head.ArrowTabular <- head.ArrowDatum
#' @export
tail.ArrowTabular <- tail.ArrowDatum
+#' @export
+na.fail.ArrowTabular <- function(object, ...){
+ na_count <- sum(purrr::map_int(object$columns, ~.x$null_count))
+ if(na_count > 0){
+ stop("missing values in object")
+ }
Review comment:
There's a nonzero cost to instantiating R6 objects, so if `object` has
lots of columns, `object$columns` might do more work than we need, especially
if the first column has missing values. So I might suggest a `for` loop here:
```suggestion
for (col in seq_len(object$num_columns)) {
if (object$column(col - 1L)$null_count > 0) {
stop("missing values in object", call. = FALSE)
}
}
```
Even better would be to push this for loop into C++, but I don't think it's
necessary--I don't think this is a function that will get lots of use.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]