eitsupi commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1007998182


##########
r/vignettes/data_wrangling.Rmd:
##########
@@ -0,0 +1,172 @@
+---
+title: "Data analysis with dplyr syntax"
+description: >
+  Learn how to use the `dplyr` backend supplied by `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides a `dplyr` back end that allows users to 
manipulate tabular Arrow data (`Table` and `Dataset` objects) using familiar 
`dplyr` syntax. To use this functionality, make sure that the `arrow` and 
`dplyr` packages are both loaded. In this article we will take the `starwars` 
data set included in `dplyr`, convert it to an Arrow Table, and then analyze 
this data. Note that, although these examples all use an in-memory `Table` 
object, the same functionality works for an on-disk `Dataset` object with only 
minor differences in behavior (documented later in the article).
+
+To get started let's load the packages and create the data:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+sw <- arrow_table(starwars, as_data_frame = FALSE)
+```
+
+## One-table dplyr verbs
+
+The `arrow` package provides support for the `dplyr` one-table verbs, allowing 
users to construct data analysis pipelines in a familiar way. The example below 
shows the use of `filter()`, `rename()`, `mutate()`, `arrange()` and `select()`:
+
+```{r}
+result <- sw %>%
+  filter(homeworld == "Tatooine") %>%
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
+
+It is important to note that `arrow` users lazy evaluation to delay 
computation until the result is explicitly requested. This speeds up processing 
by enabling the Arrow C++ library to perform multiple computations in one 
operation. As a consequence of this design choice, we have not yet performed 
computations on the `sw` data have been performed. The `result` variable is an 
object with class `arrow_dplyr_query` that represents all the computations to 
be performed:
+
+```{r}
+result
+```
+
+To perform these computations and materialize the result, we call
+`compute()` or `collect()`. The difference between the two determines what 
kind of object will be returned. Calling `compute()` returns an Arrow Table, 
suitable for passing to other `arrow` or `dplyr` functions:
+
+```{r}
+compute(result)
+```
+
+In contrast, `collect()` returns an R data frame, suitable for viewing or 
passing to other R functions for analysis or visualization:
+
+```{r}
+collect(result)
+```
+
+The `arrow` package has broad support for single-table `dplyr` verbs, 
including those that compute aggregates. For example, it supports `group_by()` 
and `summarize()`, as well as commonly-used convenience functions such as 
`count()`:
+
+```{r}
+sw %>%
+  group_by(species) %>%
+  summarize(mean_height = mean(height, na.rm = TRUE)) %>%
+  collect()
+
+sw %>% 
+  count(gender) %>%
+  collect()
+```
+
+Note, however, that window functions such as `ntile()` are not yet supported. 
+
+## Two-table dplyr verbs
+
+Equality joins (e.g. `left_join()`, `inner_join()`) are supported for joining 
multiple tables. This is illustrated below:
+
+```{r}
+jedi <- data.frame(
+  name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
+  jedi = c(FALSE, TRUE, TRUE)
+)
+
+sw %>%
+  select(1:3) %>%
+  right_join(jedi) %>%
+  collect()
+```
+
+## Expressions within dplyr verbs
+
+Inside `dplyr` verbs, Arrow offers support for many functions and operators, 
with common functions mapped to their base R and tidyverse equivalents. The 
[changelog](https://arrow.apache.org/docs/r/news/index.html) lists many of 
them. If there are additional functions you would like to see implemented, 
please file an issue as described in the [Getting 
help](https://arrow.apache.org/docs/r/#getting-help) guidelines.
+
+## Registering custom bindings
+
+The `arrow` package makes it possible for users to supply bindings for custom 
functions in some situations using `register_scalar_function()`. To operate 
correctly, the to-be-registered function must have `context` as its first 
argument, as required by the query engine. For example, suppose we wanted to 
implement a function that converts a string to snake case (a greatly simplified 
version of `janitor::make_clean_names()`). The function could be written as 
follows:
+
+```{r}
+to_snake_name <- function(context, string) {
+  replace <- c(`'` = "", `"` = "", `-` = "", `\\.` = "_", ` ` = "_")
+  string %>% 
+    stringr::str_replace_all(replace) %>%
+    stringr::str_to_lower() %>% 
+    stringi::stri_trans_general(id = "Latin-ASCII")
+}
+```
+
+To call this within an `arrow`/`dplyr` pipeline, it needs to be registered:
+
+```{r}
+register_scalar_function(
+  name = "to_snake_name",
+  fun = to_snake_name,
+  in_type = utf8(),
+  out_type = utf8(),
+  auto_convert = TRUE
+)
+```
+
+In this expression, the `name` argument specifies the name by which it will be 
recognized in the context of the `arrow`/`dplyr` pipeline and `fun` is the 
function itself. The `in_type` and `out_type` arguments are used to specify the 
expected data type for the input and output, and `auto_convert` specifies 
whether `arrow` should automatically convert any R inputs to their Arrow 
equivalents. 
+
+Once registered, the following works:
+
+```{r}
+sw %>% 
+  transmute(name, snake_name = to_snake_name(name)) %>%

Review Comment:
   How about use `mutate(.keep = "none")` instead of `transmute()`?
   tidyverse/dplyr#6414



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to