[arrow-cookbook] branch main updated: [R] Document window aggregates (#311)

thisisnic Wed, 21 Jun 2023 08:53:32 -0700

This is an automated email from the ASF dual-hosted git repository.

thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git



The following commit(s) were added to refs/heads/main by this push:
     new 2b473ce  [R] Document window aggregates (#311)
2b473ce is described below

commit 2b473ce0391af2cdddcc4b00445fa3c51f47c27b
Author: David Greiss <[email protected]>
AuthorDate: Wed Jun 21 11:53:24 2023 -0400

    [R] Document window aggregates (#311)
---
 r/content/tables.Rmd             | 100 ++++++++++++++++++++++++++++++++++++++-
 r/scripts/install_dependencies.R |   2 +-
 2 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/r/content/tables.Rmd b/r/content/tables.Rmd
index 75078c2..89225ab 100644
--- a/r/content/tables.Rmd
+++ b/r/content/tables.Rmd
@@ -317,4 +317,102 @@ the development version of the Arrow R package had been 
associated with their op
 classes.  However, as the Arrow C++ library's functionality extends, compute 
 functions may be added which do not yet have an R binding.  If you find a C++ 
 compute function which you wish to use from the R package, please [open an 
issue
-on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
\ No newline at end of file
+on the Github project](https://github.com/apache/arrow/issues).
+
+## Compute Window Aggregates
+
+You want to apply an aggregation (e.g. `mean()`) on a grouped table or within 
a rowwise operation like `filter()`:
+
+### Solution
+
+```{r, arrow_window_aggregate_joins}
+arrow_table(starwars) %>%
+  select(1:4) %>%
+  filter(!is.na(hair_color)) %>%
+  left_join(
+    arrow_table(starwars) %>%
+      group_by(hair_color) %>%
+      summarize(mean_height = mean(height, na.rm = TRUE))
+  ) %>%
+  filter(height < mean_height) %>%
+  select(!mean_height) %>%
+  collect()
+```
+
+```{r, test_arrow_window_aggregate_joins, opts.label = "test"}
+test_that("arrow_window_aggregate_joins", {
+  out <- arrow_table(starwars) %>%
+    select(1:4) %>%
+    filter(!is.na(hair_color)) %>%
+    left_join(
+      arrow_table(starwars) %>%
+        group_by(hair_color) %>%
+        summarize(mean_height = mean(height, na.rm = TRUE))
+    ) %>%
+    filter(height < mean_height) %>%
+    select(!mean_height) %>%
+    collect()
+
+  expected <- starwars %>%
+    select(name, height, hair_color) %>%
+    filter(!is.na(hair_color)) %>%
+    group_by(hair_color) %>%
+    filter(height < mean(height, na.rm = TRUE))
+
+  expect_equal(nrow(out), nrow(expected))
+  expect_equal(sum(out$height), sum(expected$height))
+})
+```
+
+Or using `to_duckdb()`
+```{r, arrow_window_aggregate_duckdb}
+arrow_table(starwars) %>%
+  select(1:4) %>%
+  filter(!is.na(hair_color)) %>%
+  to_duckdb() %>%
+  group_by(hair_color) %>%
+  filter(height < mean(height, na.rm = TRUE)) %>%
+  to_arrow() %>%
+  collect()
+```
+
+```{r, test_arrow_window_aggregate_duckdb, opts.label = "test"}
+test_that("arrow_window_aggregate_duckdb", {
+  out <- arrow_table(starwars) %>%
+    select(1:4) %>%
+    filter(!is.na(hair_color)) %>%
+    to_duckdb() %>%
+    group_by(hair_color) %>%
+    filter(height < mean(height, na.rm = TRUE)) %>%
+    to_arrow() %>%
+    collect()
+
+  expected <- starwars %>%
+    select(name, height, hair_color) %>%
+    filter(!is.na(hair_color)) %>%
+    group_by(hair_color) %>%
+    filter(height < mean(height, na.rm = TRUE))
+
+  expect_equal(nrow(out), nrow(expected))
+  expect_equal(sum(out$height), sum(expected$height))
+})
+```
+
+### Discusson
+
+Arrow does not support window functions, and pulls the data into R. For large 
tables, this sacrifices performance. 
+
+```{r, arrow_window_aggregate}
+arrow_table(starwars) %>%
+  select(1:4) %>%
+  filter(!is.na(hair_color)) %>%
+  group_by(hair_color) %>%
+  filter(height < mean(height, na.rm = TRUE))
+```
+
+You can perform these window aggregate operations on Arrow tables by:
+
+- Computing the aggregation separately, and joining the result
+- Passing the data to DuckDB, and use the DuckDB query engine to perform the 
operations
+
+Arrow supports zero-copy integration with DuckDB, and DuckDB can query Arrow 
datasets directly and stream query results back to Arrow. This integreation 
uses zero-copy streaming of data between DuckDB and Arrow and vice versa so 
that you can compose a query using both together, all the while not paying any 
cost to (re)serialize the data when you pass it back and forth. This is 
especially useful in cases where something is supported in one of Arrow or 
DuckDB query engines but not the othe [...]
\ No newline at end of file
diff --git a/r/scripts/install_dependencies.R b/r/scripts/install_dependencies.R
index 60eec6d..5daec6a 100644
--- a/r/scripts/install_dependencies.R
+++ b/r/scripts/install_dependencies.R
@@ -82,7 +82,7 @@ install_arrow_version <- function(version_to_install) {
 }
 
 dependencies <- c("testthat", "bookdown", "knitr", "purrr", "remotes", 
-                  "dplyr", "stringr", "reticulate")
+                  "dplyr", "stringr", "reticulate", "duckdb", "dbplyr")
 
 for (dependency in dependencies) {
   load_package(dependency)

[arrow-cookbook] branch main updated: [R] Document window aggregates (#311)

Reply via email to