This is an automated email from the ASF dual-hosted git repository.

thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git


The following commit(s) were added to refs/heads/main by this push:
     new 2229a93  ARROW-13732: [Doc][Cookbook] Manipulating and analyze Arrow 
data with dplyr verbs - R (#78)
2229a93 is described below

commit 2229a93af99f642ce1fc0df691d07a6148a9f433
Author: Nic <[email protected]>
AuthorDate: Thu Oct 21 09:37:44 2021 +0100

    ARROW-13732: [Doc][Cookbook] Manipulating and analyze Arrow data with dplyr 
verbs - R (#78)
    
    * Update chapter to follow problem/solution/discussion format
    
    * Split data manipulation chapter into tables/arrays and add initial content
    
    * Remove assignment and have simpler chains
    
    * Shorten line
    
    * Add content on using compute functions not implemented in the R package
    
    * Remove the word tidyverse as it's inaccurate
    
    * Add heading
    
    * Entirely refactor
    
    * Add comment with current missing content
    
    * Actually use Arrow
    
    * Add "what you should know" section and do loads of rephrasing and adding 
examples
    
    * Add "what you should know before, and content on calling functions 
directly
    
    * Add test
    
    * Rename files
    
    * Add tests to code chunks
    
    * Add note about collect/create
    
    * Fix bad link
    
    * Update r/content/arrays.Rmd
    
    Co-authored-by: Weston Pace <[email protected]>
    
    * Update r/content/arrays.Rmd
    
    Co-authored-by: Weston Pace <[email protected]>
    
    * Update r/content/arrays.Rmd
    
    Co-authored-by: Weston Pace <[email protected]>
    
    * Remove "where possible"
    
    * Rephrase to clarify
    
    * restyle
    
    * More rephrasing
    
    * Add some simpler intro content to the dplyr chapter
    
    * Add tests for intro code
    
    * Put dataset in Table$create() call
    
    * Reduce whitespace
    
    * Use arrow_table instead of Table$create
    
    * Use Table$create for the moment while the next version isn't on CRAN yet
    
    * Erroneous renaming
    
    Co-authored-by: Weston Pace <[email protected]>
---
 r/content/_bookdown.yml                |   4 +-
 r/content/arrays.Rmd                   | 167 +++++++++++++++++++
 r/content/creating_arrow_objects.Rmd   |   2 +-
 r/content/index.Rmd                    |   8 +-
 r/content/manipulating_data.Rmd        |  75 ---------
 r/content/reading_and_writing_data.Rmd |   8 +-
 r/content/tables.Rmd                   | 290 +++++++++++++++++++++++++++++++++
 7 files changed, 472 insertions(+), 82 deletions(-)

diff --git a/r/content/_bookdown.yml b/r/content/_bookdown.yml
index 51b9ef2..03299b2 100644
--- a/r/content/_bookdown.yml
+++ b/r/content/_bookdown.yml
@@ -4,10 +4,12 @@ new_session: FALSE
 clean: ["_book/*"]
 output_dir: _book
 edit: https://github.com/apache/arrow-cookbook/edit/main/r/content/%s
+
 rmd_files: [
   "index.Rmd",
   "reading_and_writing_data.Rmd",
   "creating_arrow_objects.Rmd",
   "specify_data_types_and_schemas.Rmd",
-  "manipulating_data.Rmd"
+  "arrays.Rmd",
+  "tables.Rmd"
 ]
diff --git a/r/content/arrays.Rmd b/r/content/arrays.Rmd
new file mode 100644
index 0000000..2544cb1
--- /dev/null
+++ b/r/content/arrays.Rmd
@@ -0,0 +1,167 @@
+# Manipulating Data - Arrays
+
+__What you should know before you begin__
+
+An Arrow Array is roughly equivalent to an R vector - it can be used to 
+represent a single column of data, with all values having the same data type.  
+
+A number of base R functions which have S3 generic methods have been 
implemented 
+to work on Arrow Arrays; for example `mean`, `min`, and `max`.  
+
+## Computing Mean/Min/Max, etc value of an Array
+
+You want to calculate the mean, minimum, or maximum of values in an array.
+
+### Solution
+
+```{r, array_mean_na}
+my_values <- Array$create(c(1:5, NA))
+mean(my_values, na.rm = TRUE)
+```
+```{r, test_array_mean_na, opts.label = "test"}
+test_that("array_mean_na works as expected", {
+  expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3))
+})
+```
+
+### Discussion
+
+Many base R generic functions such as `mean()`, `min()`, and `max()` have been
+mapped to their Arrow equivalents, and so can be called on Arrow Array objects 
+in the same way. They will return Arrow objects themselves.
+
+If you want to use an R function which does not have an Arrow mapping, you can 
+use `as.vector()` to convert Arrow objects to base R vectors.
+
+```{r, array_fivenum}
+arrow_array <- Array$create(1:100)
+# get Tukey's five-number summary
+fivenum(as.vector(arrow_array))
+```
+```{r, test_array_fivenum, opts.label = "test"}
+
+test_that("array_fivenum works as expected", {
+  
+  # generates both an error and a warning
+  expect_warning(
+    expect_error(fivenum(arrow_array))  
+  )
+  
+  expect_identical(
+    fivenum(as.vector(arrow_array)),
+    c(1, 25.5, 50.5, 75.5, 100)
+  )
+})
+
+```
+
+You can tell if a function is a standard S3 generic function by looking 
+at the body of the function - S3 generic functions call `UseMethod()`
+to determine the appropriate version of that function to use for the object.
+
+```{r}
+mean
+```
+
+You can also use `isS3stdGeneric()` to determine if a function is an S3 
generic.
+
+```{r}
+isS3stdGeneric("mean")
+```
+
+If you find an S3 generic function which isn't implemented for Arrow objects 
+but you would like to be able to use, please 
+[open an issue on the project 
JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
+
+## Counting occurrences of elements in an Array
+
+You want to count repeated values in an Array.
+
+### Solution
+
+```{r, value_counts}
+repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
+value_counts(repeated_vals)
+```
+
+```{r, test_value_counts, opts.label = "test"}
+test_that("value_counts works as expected", {
+  expect_equal(
+    as.vector(value_counts(repeated_vals)),
+    tibble(
+      values = as.numeric(names(table(as.vector(repeated_vals)))),
+      counts = as.vector(table(as.vector(repeated_vals)))
+    )
+  )
+})
+```
+
+### Discussion
+
+Some functions in the Arrow R package do not have base R equivalents. In other 
+cases, the base R equivalents are not generic functions so they cannot be 
called
+directly on Arrow Array objects.
+
+For example, the `value_counts()` function in the Arrow R package is loosely 
+equivalent to the base R function `table()`, which is not a generic function. 
+
+## Applying arithmetic functions to Arrays.
+
+You want to use the various arithmetic operators on Array objects.
+
+### Solution
+
+```{r, add_array}
+num_array <- Array$create(1:10)
+num_array + 10
+```
+```{r, test_add_array, opts.label = "test"}
+test_that("add_array works as expected", {
+  # need to specify expected array as 1:10 + 10 instead of 11:20 so is double 
not integer
+  expect_equal(num_array + 10, Array$create(1:10 + 10))
+})
+```
+
+### Discussion
+
+You will get the same result if you pass in the value you're adding as an 
Arrow object.
+
+```{r, add_array_scalar}
+num_array + Scalar$create(10)
+```
+```{r, test_add_array_scalar, opts.label = "test"}
+test_that("add_array_scalar works as expected", {
+  # need to specify expected array as 1:10 + 10 instead of 11:20 so is double 
not integer
+  expect_equal(num_array + Scalar$create(10), Array$create(1:10 + 10))
+})
+```
+
+## Calling Arrow compute functions directly on Arrays
+
+You want to call an Arrow compute function directly on an Array.
+
+### Solution
+
+```{r, call_function}
+first_100_numbers <- Array$create(1:100)
+
+# Calculate the variance of 1 to 100, setting the delta degrees of freedom to 
0.
+call_function("variance", first_100_numbers, options = list(ddof = 0))
+
+```
+```{r, test_call_function, opts.label = "test"}
+test_that("call_function works as expected", {
+  expect_equal(
+    call_function("variance", first_100_numbers, options = list(ddof = 0)),
+    Scalar$create(833.25)
+  )
+})
+```
+### Discussion
+
+You can use `call_function()` to call Arrow compute functions directly on 
+Scalar, Array, and ChunkedArray objects.  The returned object will be an Arrow 
object.
+
+### See also
+
+For a more in-depth discussion of Arrow compute functions, see the section on 
(using arrow functions in dplyr verbs in 
arrow)[#Using-arrow-functions-in-dplyr-verbs-in-arrow]
diff --git a/r/content/creating_arrow_objects.Rmd 
b/r/content/creating_arrow_objects.Rmd
index 64d3780..37404af 100644
--- a/r/content/creating_arrow_objects.Rmd
+++ b/r/content/creating_arrow_objects.Rmd
@@ -1,4 +1,4 @@
-# Creating Arrow Objects
+# Creating Arrow Objects {#creating-arrow-objects}
 
 ## Create an Arrow Array from an R object
 
diff --git a/r/content/index.Rmd b/r/content/index.Rmd
index 66faf58..a3f60cc 100644
--- a/r/content/index.Rmd
+++ b/r/content/index.Rmd
@@ -18,4 +18,10 @@ knitr::opts_template$set(test = list(
 
 # Preface
 
-This cookbook aims to provide a number of recipes showing how to perform 
common tasks using `arrow`.
+This cookbook aims to provide a number of recipes showing how to perform 
common 
+tasks using `arrow`.
+
+## Alternative resources
+
+For a complete reference guide to the functions in `arrow`, as well as 
vignettes, 
+see [the pkgdown site](https://arrow.apache.org/docs/r/).
diff --git a/r/content/manipulating_data.Rmd b/r/content/manipulating_data.Rmd
deleted file mode 100644
index fa9e440..0000000
--- a/r/content/manipulating_data.Rmd
+++ /dev/null
@@ -1,75 +0,0 @@
-# Manipulating Data
-
-## Computing Mean/Min/Max, etc value of an Array
-
-Many base R generic functions such as `mean()`, `min()`, and `max()` have been 
mapped to their Arrow equivalents, and so can be called on Arrow Array objects 
in the same way. They will return Arrow objects themselves.
-
-```{r, array_mean_na}
-my_values <- Array$create(c(1:5, NA))
-mean(my_values, na.rm = TRUE)
-```
-```{r, test_array_mean_na, opts.label = "test"}
-test_that("array_mean_na works as expected", {
-  expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3))
-})
-```
-If you want to use an R function which does not have an Arrow mapping, you can 
use `as.vector()` to convert Arrow objects to base R vectors.
-
-```{r, fivenum}
-fivenum(as.vector(my_values))
-```
-```{r, test_fivenum, opts.label = "test"}
-test_that("fivenum works as expected", {
-  expect_equal(fivenum(as.vector(my_values)), 1:5)
-})
-```
-
-## Counting occurrences of elements in an Array
-
-Some functions in the Arrow R package do not have base R equivalents. In other 
cases, the base R equivalents are not generic functions so they cannot be 
called directly on Arrow Array objects.
-
-For example, the `value_count()` function in the Arrow R package is loosely 
equivalent to the base R function `table()`, which is not a generic function. 
To count the elements in an R vector, you can use `table()`; to count the 
elements in an Arrow Array, you can use `value_count()`.
-
-```{r, value_counts}
-repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
-value_counts(repeated_vals)
-```
-
-```{r, test_value_counts, opts.label = "test"}
-test_that("value_counts works as expected", {
-  expect_equal(
-    as.vector(value_counts(repeated_vals)),
-    tibble(
-      values = as.numeric(names(table(as.vector(repeated_vals)))),
-      counts = as.vector(table(as.vector(repeated_vals)))
-    )
-  )
-})
-```
-
-## Applying arithmetic functions to Arrays.
-
-You can use the various arithmetic operators on Array objects.
-
-```{r, add_array}
-num_array <- Array$create(1:10)
-num_array + 10
-```
-```{r, test_add_array, opts.label = "test"}
-test_that("add_array works as expected", {
-  # need to specify expected array as 1:10 + 10 instead of 11:20 so is double 
not integer
-  expect_equal(num_array + 10, Array$create(1:10 + 10))
-})
-```
-
-You will get the same result if you pass in the value you're adding as an 
Arrow object.
-
-```{r, add_array_scalar}
-num_array + Scalar$create(10)
-```
-```{r, test_add_array_scalar, opts.label = "test"}
-test_that("add_array_scalar works as expected", {
-  # need to specify expected array as 1:10 + 10 instead of 11:20 so is double 
not integer
-  expect_equal(num_array + Scalar$create(10), Array$create(1:10 + 10))
-})
-```
diff --git a/r/content/reading_and_writing_data.Rmd 
b/r/content/reading_and_writing_data.Rmd
index 47274ad..8cd2e36 100644
--- a/r/content/reading_and_writing_data.Rmd
+++ b/r/content/reading_and_writing_data.Rmd
@@ -94,16 +94,16 @@ test_that("read_parquet_2 works as expected", {
 If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow 
Table.
 
 ```{r, read_parquet_table}
-my_table_arrow_table <- read_parquet("my_table.parquet", as_data_frame = FALSE)
-my_table_arrow_table
+my_table_arrow <- read_parquet("my_table.parquet", as_data_frame = FALSE)
+my_table_arrow
 ```
 
 ```{r, read_parquet_table_class}
-class(my_table_arrow_table)
+class(my_table_arrow)
 ```
 ```{r, test_read_parquet_table_class, opts.label = "test"}
 test_that("read_parquet_table_class works as expected", {
-  expect_s3_class(my_table_arrow_table, "Table")
+  expect_s3_class(my_table_arrow, "Table")
 })
 ```
 
diff --git a/r/content/tables.Rmd b/r/content/tables.Rmd
new file mode 100644
index 0000000..1c935d7
--- /dev/null
+++ b/r/content/tables.Rmd
@@ -0,0 +1,290 @@
+# Manipulating Data - Tables
+
+## Introduction
+
+One of the aims of the Arrow project is to reduce duplication between 
different 
+data frame implementations.  The underlying implementation of a data frame is 
a 
+conceptually different thing to the code that you run to work with it - the 
API.
+
+You may have seen this before in packages like `dbplyr` which allow you to use 
+the dplyr API to interact with SQL databases.
+
+The `arrow` package has been written so that the underlying Arrow table-like 
+objects can be manipulated via use of the dplyr API via the dplyr verbs.
+
+For example, here's a short pipeline of data manipulation which uses dplyr 
exclusively:
+  
+```{r, dplyr_raw}
+library(dplyr)
+starwars %>%
+  filter(species == "Human") %>%
+  mutate(height_ft = height/30.48) %>%
+  select(name, height_ft)
+```
+
+And the same results as using arrow with dplyr syntax:
+  
+```{r, dplyr_arrow}
+Table$create(starwars) %>%
+  filter(species == "Human") %>%
+  mutate(height_ft = height/30.48) %>%
+  select(name, height_ft) %>%
+  collect()
+```
+
+```{r, test_dplyr_raw_and_arrow, opts.label = "test"}
+
+test_that("dplyr_raw and dplyr_arrow chunk provide the same results", {
+  
+  expect_equal(
+    starwars %>%
+      filter(species == "Human") %>%
+      mutate(height_ft = height/30.48) %>%
+      select(name, height_ft),
+    Table$create(starwars) %>%
+      filter(species == "Human") %>%
+      mutate(height_ft = height/30.48) %>%
+      select(name, height_ft) %>%
+      collect()
+  )
+  
+})
+
+```
+
+
+You'll notice we've used `collect()` in the Arrow pipeline above.  That's 
because 
+one of the ways in which `arrow` is efficient is that it works out the 
instructions
+for the calculations it needs to perform (_expressions_) and only runs them 
once 
+you actually pull the data into your R session.  This means instead of doing 
+lots of separate operations, it does them all at once in a more optimised way, 
+_lazy evaluation_.
+
+It also means that you are able to manipulate data that is larger than you can 
+fit into memory on the machine you're running your code on, if you only pull 
+data into R when you have selected the desired subset. 
+
+You can also have data which is split across multiple files.  For example, you
+might have files which are stored in multiple Parquet or Feather files, 
+partitioned across different directories.  You can open multi-file datasets 
+using `open_dataset()` as discussed in a previous chapter, and then manipulate 
+this data using arrow before even reading any of it into R.
+
+## Use dplyr verbs in arrow
+
+You want to use a dplyr verb in arrow.
+
+### Solution
+
+```{r, dplyr_verb}
+library(dplyr)
+Table$create(starwars) %>%
+  filter(species == "Human", homeworld == "Tatooine") %>%
+  collect()
+```
+
+```{r, test_dplyr_verb, opts.label = "test"}
+
+test_that("dplyr_verb works as expected", {
+  out <- Table$create(starwars) %>%
+    filter(species == "Human", homeworld == "Tatooine") %>%
+    collect()
+
+  expect_equal(nrow(out), 8)
+  expect_s3_class(out, "data.frame")
+  expect_identical(unique(out$species), "Human")
+  expect_identical(unique(out$homeworld), "Tatooine")
+})
+
+```
+
+### Discussion
+
+You can use most of the dplyr verbs directly from arrow.  
+
+### See also
+
+You can find examples of the various dplyr verbs in "Introduction to dplyr" - 
+run `vignette("dplyr", package = "dplyr")` or view on
+the [pkgdown site](https://dplyr.tidyverse.org/articles/dplyr.html).
+
+You can see more information about using `Table$create()` to create Arrow 
Tables
+and `collect()` to view them as R data frames in [Creating Arrow 
Objects](creating-arrow-objects.html#creating-arrow-objects).
+
+## Use R functions in dplyr verbs in arrow
+
+You want to use an R function inside a dplyr verb in arrow.
+
+### Solution
+
+```{r, dplyr_str_detect}
+Table$create(starwars) %>%
+  filter(str_detect(name, "Darth")) %>%
+  collect()
+```
+
+```{r, test_dplyr_str_detect, opts.label = "test"}
+
+test_that("dplyr_str_detect", {
+  out <- Table$create(starwars) %>%
+    filter(str_detect(name, "Darth")) %>%
+    collect()
+  
+  expect_equal(nrow(out), 2)
+  expect_equal(sort(out$name), c("Darth Maul", "Darth Vader"))
+  
+})
+
+```
+
+### Discussion
+
+The arrow package allows you to use dplyr verbs containing expressions which 
+include base R and many tidyverse functions, but call Arrow functions under 
the hood.
+If you find any base R or tidyverse functions which you would like to see a 
+mapping of in arrow, please 
+[open an issue on the project 
JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
+
+If you try to call a function which does not have arrow mapping, the data will 
+be pulled back into R, and you will see a warning message.
+
+```{r, dplyr_func_warning}
+library(stringr)
+
+Table$create(starwars) %>%
+  mutate(name_split = str_split_fixed(name, " ", 2)) %>%
+  collect()
+```
+
+```{r, test_dplyr_func_warning, opts.label = "test"}
+
+test_that("dplyr_func_warning", {
+  
+  expect_warning(
+     Table$create(starwars) %>%
+      mutate(name_split = str_split_fixed(name, " ", 2)) %>%
+      collect(),
+    'Expression str_split_fixed(name, " ", 2) not supported in Arrow; pulling 
data into R',
+    fixed = TRUE
+  )
+
+})
+```
+## Use arrow functions in dplyr verbs in arrow
+
+You want to use a function which is implemented in Arrow's C++ library but 
either:
+* it doesn't have a mapping to a base R or tidyverse equivalent, or 
+* it has a mapping but nevertheless you want to call the C++ function directly
+
+### Solution
+
+```{r, dplyr_arrow_func}
+Table$create(starwars) %>%
+  select(name) %>%
+  mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, 
padding = "*"))) %>%
+  collect()
+```
+```{r, test_dplyr_arrow_func, opts.label = "test"}
+
+test_that("dplyr_arrow_func", {
+  out <- Table$create(starwars) %>%
+    select(name) %>%
+    mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, 
padding = "*"))) %>%
+    collect()
+  
+  expect_match(out$padded_name, "*****C-3PO", fixed = TRUE, all = FALSE)
+  
+})
+
+```
+### Discussion
+
+The vast majority of Arrow C++ compute functions have been mapped to their 
+base R or tidyverse equivalents, and we strongly recommend that you use 
+these mappings where possible, as the original functions are well documented
+and the mapped versions have been tested to ensure the results returned are as 
+expected.
+
+However, there may be circumstances in which you might want to use a compute 
+function from the Arrow C++ library which does not have a base R or tidyverse 
+equivalent.
+
+You can find documentation of Arrow C++ compute functions in 
+[the C++ 
documention](https://arrow.apache.org/docs/cpp/compute.html#available-functions).
+This documentation lists all available compute functions, any associated 
options classes 
+they need, and the valid data types that they can be used with.
+
+You can list all available Arrow compute functions from R by calling 
+`list_compute_functions()`.
+
+```{r, list_compute_funcs}
+list_compute_functions()
+```
+```{r, test_list_compute_funcs, opts.label = "test"}
+test_that("list_compute_funcs", {
+  expect_gt(length(list_compute_functions()), 0)
+})
+```
+
+The majority of functions here have been mapped to their base R or tidyverse 
+equivalent and can be called within a dplyr query as usual.  For functions 
which
+don't have a base R or tidyverse equivalent, or you want to supply custom 
+options, you can call them by prefixing their name with "arrow_".  
+
+For example, base R's `is.na()` function is the equivalent of the Arrow C++ 
+compute function `is_null()` with the option `nan_is_null` set to `TRUE`.  
+A mapping between these functions (with `nan_is_null` set to `TRUE`) has been
+created in arrow.
+
+```{r, dplyr_is_na}
+demo_df <- data.frame(x = c(1, 2, 3, NA, NaN))
+
+Table$create(demo_df) %>%
+  mutate(y = is.na(x)) %>% 
+  collect()
+```
+
+```{r, test_dplyr_is_na, opts.label = "test"}
+test_that("dplyr_is_na", {
+  out <- Table$create(demo_df) %>%
+    mutate(y = is.na(x)) %>% 
+    collect()
+  
+  expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, TRUE))
+  
+})
+```
+
+If you want to call Arrow's `is_null()` function but with `nan_is_null` set to 
+`FALSE` (so it returns `TRUE` when a value being examined is `NA` but `FALSE` 
+when the value being examined is `NaN`), you must call `is_null()` directly 
and 
+specify the option `nan_is_null = FALSE`.
+
+```{r, dplyr_arrow_is_null}
+Table$create(demo_df) %>%
+  mutate(y = arrow_is_null(x, options  = list(nan_is_null = FALSE))) %>% 
+  collect()
+```
+
+```{r, test_dplyr_arrow_is_null, opts.label = "test"}
+test_that("dplyr_arrow_is_null", {
+  out <- Table$create(demo_df) %>%
+    mutate(y = arrow_is_null(x, options  = list(nan_is_null = FALSE))) %>% 
+    collect()
+  
+  expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, FALSE))
+  
+})
+```
+
+#### Compute functions with options
+
+Although not all Arrow C++ compute functions require options to be specified, 
+most do.  For these functions to work in R, they must be linked up 
+with the appropriate libarrow options C++ class via the R 
+package's C++ code.  At the time of writing, all compute functions available in
+the development version of the arrow R package had been associated with their 
options
+classes.  However, as the Arrow C++ library's functionality extends, compute 
+functions may be added which do not yet have an R binding.  If you find a C++ 
+compute function which you wish to use from the R package, please [open an 
issue
+on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
\ No newline at end of file

Reply via email to