eitsupi commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1007998182
##########
r/vignettes/data_wrangling.Rmd:
##########
@@ -0,0 +1,172 @@
+---
+title: "Data analysis with dplyr syntax"
+description: >
+ Learn how to use the `dplyr` backend supplied by `arrow`
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides a `dplyr` back end that allows users to
manipulate tabular Arrow data (`Table` and `Dataset` objects) using familiar
`dplyr` syntax. To use this functionality, make sure that the `arrow` and
`dplyr` packages are both loaded. In this article we will take the `starwars`
data set included in `dplyr`, convert it to an Arrow Table, and then analyze
this data. Note that, although these examples all use an in-memory `Table`
object, the same functionality works for an on-disk `Dataset` object with only
minor differences in behavior (documented later in the article).
+
+To get started let's load the packages and create the data:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+sw <- arrow_table(starwars, as_data_frame = FALSE)
+```
+
+## One-table dplyr verbs
+
+The `arrow` package provides support for the `dplyr` one-table verbs, allowing
users to construct data analysis pipelines in a familiar way. The example below
shows the use of `filter()`, `rename()`, `mutate()`, `arrange()` and `select()`:
+
+```{r}
+result <- sw %>%
+ filter(homeworld == "Tatooine") %>%
+ rename(height_cm = height, mass_kg = mass) %>%
+ mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+ arrange(desc(birth_year)) %>%
+ select(name, height_in, mass_lbs)
+```
+
+It is important to note that `arrow` users lazy evaluation to delay
computation until the result is explicitly requested. This speeds up processing
by enabling the Arrow C++ library to perform multiple computations in one
operation. As a consequence of this design choice, we have not yet performed
computations on the `sw` data have been performed. The `result` variable is an
object with class `arrow_dplyr_query` that represents all the computations to
be performed:
+
+```{r}
+result
+```
+
+To perform these computations and materialize the result, we call
+`compute()` or `collect()`. The difference between the two determines what
kind of object will be returned. Calling `compute()` returns an Arrow Table,
suitable for passing to other `arrow` or `dplyr` functions:
+
+```{r}
+compute(result)
+```
+
+In contrast, `collect()` returns an R data frame, suitable for viewing or
passing to other R functions for analysis or visualization:
+
+```{r}
+collect(result)
+```
+
+The `arrow` package has broad support for single-table `dplyr` verbs,
including those that compute aggregates. For example, it supports `group_by()`
and `summarize()`, as well as commonly-used convenience functions such as
`count()`:
+
+```{r}
+sw %>%
+ group_by(species) %>%
+ summarize(mean_height = mean(height, na.rm = TRUE)) %>%
+ collect()
+
+sw %>%
+ count(gender) %>%
+ collect()
+```
+
+Note, however, that window functions such as `ntile()` are not yet supported.
+
+## Two-table dplyr verbs
+
+Equality joins (e.g. `left_join()`, `inner_join()`) are supported for joining
multiple tables. This is illustrated below:
+
+```{r}
+jedi <- data.frame(
+ name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
+ jedi = c(FALSE, TRUE, TRUE)
+)
+
+sw %>%
+ select(1:3) %>%
+ right_join(jedi) %>%
+ collect()
+```
+
+## Expressions within dplyr verbs
+
+Inside `dplyr` verbs, Arrow offers support for many functions and operators,
with common functions mapped to their base R and tidyverse equivalents. The
[changelog](https://arrow.apache.org/docs/r/news/index.html) lists many of
them. If there are additional functions you would like to see implemented,
please file an issue as described in the [Getting
help](https://arrow.apache.org/docs/r/#getting-help) guidelines.
+
+## Registering custom bindings
+
+The `arrow` package makes it possible for users to supply bindings for custom
functions in some situations using `register_scalar_function()`. To operate
correctly, the to-be-registered function must have `context` as its first
argument, as required by the query engine. For example, suppose we wanted to
implement a function that converts a string to snake case (a greatly simplified
version of `janitor::make_clean_names()`). The function could be written as
follows:
+
+```{r}
+to_snake_name <- function(context, string) {
+ replace <- c(`'` = "", `"` = "", `-` = "", `\\.` = "_", ` ` = "_")
+ string %>%
+ stringr::str_replace_all(replace) %>%
+ stringr::str_to_lower() %>%
+ stringi::stri_trans_general(id = "Latin-ASCII")
+}
+```
+
+To call this within an `arrow`/`dplyr` pipeline, it needs to be registered:
+
+```{r}
+register_scalar_function(
+ name = "to_snake_name",
+ fun = to_snake_name,
+ in_type = utf8(),
+ out_type = utf8(),
+ auto_convert = TRUE
+)
+```
+
+In this expression, the `name` argument specifies the name by which it will be
recognized in the context of the `arrow`/`dplyr` pipeline and `fun` is the
function itself. The `in_type` and `out_type` arguments are used to specify the
expected data type for the input and output, and `auto_convert` specifies
whether `arrow` should automatically convert any R inputs to their Arrow
equivalents.
+
+Once registered, the following works:
+
+```{r}
+sw %>%
+ transmute(name, snake_name = to_snake_name(name)) %>%
Review Comment:
How about use `mutate(.keep = "none")` instead of `transmute()`?
tidyverse/dplyr#6414
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]