[arrow] branch master updated: ARROW-16509: [R][Docs] Make corrections to datasets vignette

thisisnic Mon, 23 May 2022 14:19:48 -0700

This is an automated email from the ASF dual-hosted git repository.

thisisnic pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git



The following commit(s) were added to refs/heads/master by this push:
     new 119ead4346 ARROW-16509: [R][Docs] Make corrections to datasets vignette
119ead4346 is described below

commit 119ead4346d8749f793476a182f1cc55c7d6401b
Author: Will Jones <[email protected]>
AuthorDate: Mon May 23 22:19:34 2022 +0100

    ARROW-16509: [R][Docs] Make corrections to datasets vignette
    
    These changes correct a few inaccuracies in the datasets vignette. 
Enhancements to describe additional features will be left to 
[ARROW-12137](https://issues.apache.org/jira/browse/ARROW-12137).
    
    Closes #13178 from wjones127/ARROW-16509-dataset-vignette
    
    Authored-by: Will Jones <[email protected]>
    Signed-off-by: Nic Crane <[email protected]>
---
 r/vignettes/dataset.Rmd | 59 +++++++++++++++++++++++++++----------------------
 1 file changed, 33 insertions(+), 26 deletions(-)

diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd
index f09185589e..5c430c4be0 100644
--- a/r/vignettes/dataset.Rmd
+++ b/r/vignettes/dataset.Rmd
@@ -192,11 +192,18 @@ files, you've parsed file paths to identify partitions, 
and you've read the
 headers of the Parquet files to inspect their schemas so that you can make sure
 they all are as expected.
 
-In the current release, arrow supports the dplyr verbs `mutate()`, 
-`transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and 
-`arrange()`. Aggregation is not yet supported, so before you call `summarise()`
-or other verbs with aggregate functions, use `collect()` to pull the selected
-subset of the data into an in-memory R data frame.
+In the current release, arrow supports the dplyr verbs:
+
+ * `mutate()` and `transmute()`,
+ * `select()`, `rename()`, and `relocate()`,
+ * `filter()`,
+ * `arrange()`,
+ * `union()` and `union_all()`,
+ * `left_join()`, `right_join()`, `full_join()`, `inner_join()`, and 
`anti_join()`,
+ * `group_by()` and `summarise()`.
+
+At any point in a chain, you can use `collect()` to pull the selected subset of
+the data into an in-memory R data frame. 
 
 Suppose you attempt to call unsupported dplyr verbs or unimplemented functions
 in your query on an Arrow Dataset. In that case, the arrow package raises an 
error. However,
@@ -213,11 +220,11 @@ system.time(ds %>%
   select(tip_amount, total_amount, passenger_count) %>%
   mutate(tip_pct = 100 * tip_amount / total_amount) %>%
   group_by(passenger_count) %>%
-  collect() %>%
   summarise(
     median_tip_pct = median(tip_pct),
     n = n()
   ) %>%
+  collect() %>%
   print())
 ```
 
@@ -226,16 +233,16 @@ cat("
 # A tibble: 10 x 3
    passenger_count median_tip_pct      n
              <int>          <dbl>  <int>
- 1               0           9.84    380
- 2               1          16.7  143087
- 3               2          16.6   34418
- 4               3          14.4    8922
- 5               4          11.4    4771
- 6               5          16.7    5806
- 7               6          16.7    3338
- 8               7          16.7      11
- 9               8          16.7      32
-10               9          16.7      42
+ 1               1           16.6 143087
+ 2               2           16.2  34418
+ 3               5           16.7   5806
+ 4               4           11.4   4771
+ 5               6           16.7   3338
+ 6               3           14.6   8922
+ 7               0           10.1    380
+ 8               8           16.7     32
+ 9               9           16.7     42
+10               7           16.7     11
 
    user  system elapsed
   4.436   1.012   1.402
@@ -243,31 +250,31 @@ cat("
 ```
 
 You've just selected a subset out of a dataset with around 2 billion rows, 
computed
-a new column, and aggregated it in under 2 seconds on a modern laptop. How does
+a new column, and aggregated it in a few seconds on a modern laptop. How does
 this work?
 
-First, `mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, 
`filter()`, 
-`group_by()`, and `arrange()` record their actions but don't evaluate on the
-data until you run `collect()`.
+First, the dplyr verbs on the dataset record their actions but don't evaluate 
on
+the data until you run `collect()`.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds %>%
   filter(total_amount > 100, year == 2015) %>%
   select(tip_amount, total_amount, passenger_count) %>%
   mutate(tip_pct = 100 * tip_amount / total_amount) %>%
-  group_by(passenger_count)
+  group_by(passenger_count) %>%
+  summarise(
+    median_tip_pct = median(tip_pct),
+    n = n()
+  )
 ```
 
 ```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
 cat("
 FileSystemDataset (query)
-tip_amount: float
-total_amount: float
 passenger_count: int8
-tip_pct: expr
+median_tip_pct: double
+n: int32
 
-* Filter: ((total_amount > 100) and (year == 2015))
-* Grouped by passenger_count
 See $.data for the source Arrow object
 ")
 ```

[arrow] branch master updated: ARROW-16509: [R][Docs] Make corrections to datasets vignette

Reply via email to