[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

GitBox Tue, 15 Nov 2022 04:25:48 -0800


thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1022706108



##########
r/vignettes/data_wrangling.Rmd:
##########
@@ -0,0 +1,172 @@
+---
+title: "Data analysis with dplyr syntax"
+description: >
+  Learn how to use the `dplyr` backend supplied by `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides a `dplyr` back end that allows users to 
manipulate tabular Arrow data (`Table` and `Dataset` objects) using familiar 
`dplyr` syntax. To use this functionality, make sure that the `arrow` and 
`dplyr` packages are both loaded. In this article we will take the `starwars` 
data set included in `dplyr`, convert it to an Arrow Table, and then analyze 
this data. Note that, although these examples all use an in-memory `Table` 
object, the same functionality works for an on-disk `Dataset` object with only 
minor differences in behavior (documented later in the article).

Review Comment:
   Could we use a different term to "back end" here? I've heard different 
people use the terms "back end", "frontend", "API" and other terms, and I think 
this can sound a bit ambiguous.



##########
r/vignettes/data_wrangling.Rmd:
##########
@@ -0,0 +1,172 @@
+---
+title: "Data analysis with dplyr syntax"
+description: >
+  Learn how to use the `dplyr` backend supplied by `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides a `dplyr` back end that allows users to 
manipulate tabular Arrow data (`Table` and `Dataset` objects) using familiar 
`dplyr` syntax. To use this functionality, make sure that the `arrow` and 
`dplyr` packages are both loaded. In this article we will take the `starwars` 
data set included in `dplyr`, convert it to an Arrow Table, and then analyze 
this data. Note that, although these examples all use an in-memory `Table` 
object, the same functionality works for an on-disk `Dataset` object with only 
minor differences in behavior (documented later in the article).
+
+To get started let's load the packages and create the data:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+sw <- arrow_table(starwars, as_data_frame = FALSE)
+```
+
+## One-table dplyr verbs
+
+The `arrow` package provides support for the `dplyr` one-table verbs, allowing 
users to construct data analysis pipelines in a familiar way. The example below 
shows the use of `filter()`, `rename()`, `mutate()`, `arrange()` and `select()`:
+
+```{r}
+result <- sw %>%
+  filter(homeworld == "Tatooine") %>%
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
+
+It is important to note that `arrow` users lazy evaluation to delay 
computation until the result is explicitly requested. This speeds up processing 
by enabling the Arrow C++ library to perform multiple computations in one 
operation. As a consequence of this design choice, we have not yet performed 
computations on the `sw` data have been performed. The `result` variable is an 
object with class `arrow_dplyr_query` that represents all the computations to 
be performed:

Review Comment:
   ```suggestion
   It is important to note that `arrow` uses lazy evaluation to delay 
computation until the result is explicitly requested. This speeds up processing 
by enabling the Arrow C++ library to perform multiple computations in one 
operation. As a consequence of this design choice, we have not yet performed 
computations on the `sw` data. The `result` variable is an object with class 
`arrow_dplyr_query` that represents all the computations to be performed:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Reply via email to