This is an automated email from the ASF dual-hosted git repository.
thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git
The following commit(s) were added to refs/heads/main by this push:
new 2229a93 ARROW-13732: [Doc][Cookbook] Manipulating and analyze Arrow
data with dplyr verbs - R (#78)
2229a93 is described below
commit 2229a93af99f642ce1fc0df691d07a6148a9f433
Author: Nic <[email protected]>
AuthorDate: Thu Oct 21 09:37:44 2021 +0100
ARROW-13732: [Doc][Cookbook] Manipulating and analyze Arrow data with dplyr
verbs - R (#78)
* Update chapter to follow problem/solution/discussion format
* Split data manipulation chapter into tables/arrays and add initial content
* Remove assignment and have simpler chains
* Shorten line
* Add content on using compute functions not implemented in the R package
* Remove the word tidyverse as it's inaccurate
* Add heading
* Entirely refactor
* Add comment with current missing content
* Actually use Arrow
* Add "what you should know" section and do loads of rephrasing and adding
examples
* Add "what you should know before, and content on calling functions
directly
* Add test
* Rename files
* Add tests to code chunks
* Add note about collect/create
* Fix bad link
* Update r/content/arrays.Rmd
Co-authored-by: Weston Pace <[email protected]>
* Update r/content/arrays.Rmd
Co-authored-by: Weston Pace <[email protected]>
* Update r/content/arrays.Rmd
Co-authored-by: Weston Pace <[email protected]>
* Remove "where possible"
* Rephrase to clarify
* restyle
* More rephrasing
* Add some simpler intro content to the dplyr chapter
* Add tests for intro code
* Put dataset in Table$create() call
* Reduce whitespace
* Use arrow_table instead of Table$create
* Use Table$create for the moment while the next version isn't on CRAN yet
* Erroneous renaming
Co-authored-by: Weston Pace <[email protected]>
---
r/content/_bookdown.yml | 4 +-
r/content/arrays.Rmd | 167 +++++++++++++++++++
r/content/creating_arrow_objects.Rmd | 2 +-
r/content/index.Rmd | 8 +-
r/content/manipulating_data.Rmd | 75 ---------
r/content/reading_and_writing_data.Rmd | 8 +-
r/content/tables.Rmd | 290 +++++++++++++++++++++++++++++++++
7 files changed, 472 insertions(+), 82 deletions(-)
diff --git a/r/content/_bookdown.yml b/r/content/_bookdown.yml
index 51b9ef2..03299b2 100644
--- a/r/content/_bookdown.yml
+++ b/r/content/_bookdown.yml
@@ -4,10 +4,12 @@ new_session: FALSE
clean: ["_book/*"]
output_dir: _book
edit: https://github.com/apache/arrow-cookbook/edit/main/r/content/%s
+
rmd_files: [
"index.Rmd",
"reading_and_writing_data.Rmd",
"creating_arrow_objects.Rmd",
"specify_data_types_and_schemas.Rmd",
- "manipulating_data.Rmd"
+ "arrays.Rmd",
+ "tables.Rmd"
]
diff --git a/r/content/arrays.Rmd b/r/content/arrays.Rmd
new file mode 100644
index 0000000..2544cb1
--- /dev/null
+++ b/r/content/arrays.Rmd
@@ -0,0 +1,167 @@
+# Manipulating Data - Arrays
+
+__What you should know before you begin__
+
+An Arrow Array is roughly equivalent to an R vector - it can be used to
+represent a single column of data, with all values having the same data type.
+
+A number of base R functions which have S3 generic methods have been
implemented
+to work on Arrow Arrays; for example `mean`, `min`, and `max`.
+
+## Computing Mean/Min/Max, etc value of an Array
+
+You want to calculate the mean, minimum, or maximum of values in an array.
+
+### Solution
+
+```{r, array_mean_na}
+my_values <- Array$create(c(1:5, NA))
+mean(my_values, na.rm = TRUE)
+```
+```{r, test_array_mean_na, opts.label = "test"}
+test_that("array_mean_na works as expected", {
+ expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3))
+})
+```
+
+### Discussion
+
+Many base R generic functions such as `mean()`, `min()`, and `max()` have been
+mapped to their Arrow equivalents, and so can be called on Arrow Array objects
+in the same way. They will return Arrow objects themselves.
+
+If you want to use an R function which does not have an Arrow mapping, you can
+use `as.vector()` to convert Arrow objects to base R vectors.
+
+```{r, array_fivenum}
+arrow_array <- Array$create(1:100)
+# get Tukey's five-number summary
+fivenum(as.vector(arrow_array))
+```
+```{r, test_array_fivenum, opts.label = "test"}
+
+test_that("array_fivenum works as expected", {
+
+ # generates both an error and a warning
+ expect_warning(
+ expect_error(fivenum(arrow_array))
+ )
+
+ expect_identical(
+ fivenum(as.vector(arrow_array)),
+ c(1, 25.5, 50.5, 75.5, 100)
+ )
+})
+
+```
+
+You can tell if a function is a standard S3 generic function by looking
+at the body of the function - S3 generic functions call `UseMethod()`
+to determine the appropriate version of that function to use for the object.
+
+```{r}
+mean
+```
+
+You can also use `isS3stdGeneric()` to determine if a function is an S3
generic.
+
+```{r}
+isS3stdGeneric("mean")
+```
+
+If you find an S3 generic function which isn't implemented for Arrow objects
+but you would like to be able to use, please
+[open an issue on the project
JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
+
+## Counting occurrences of elements in an Array
+
+You want to count repeated values in an Array.
+
+### Solution
+
+```{r, value_counts}
+repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
+value_counts(repeated_vals)
+```
+
+```{r, test_value_counts, opts.label = "test"}
+test_that("value_counts works as expected", {
+ expect_equal(
+ as.vector(value_counts(repeated_vals)),
+ tibble(
+ values = as.numeric(names(table(as.vector(repeated_vals)))),
+ counts = as.vector(table(as.vector(repeated_vals)))
+ )
+ )
+})
+```
+
+### Discussion
+
+Some functions in the Arrow R package do not have base R equivalents. In other
+cases, the base R equivalents are not generic functions so they cannot be
called
+directly on Arrow Array objects.
+
+For example, the `value_counts()` function in the Arrow R package is loosely
+equivalent to the base R function `table()`, which is not a generic function.
+
+## Applying arithmetic functions to Arrays.
+
+You want to use the various arithmetic operators on Array objects.
+
+### Solution
+
+```{r, add_array}
+num_array <- Array$create(1:10)
+num_array + 10
+```
+```{r, test_add_array, opts.label = "test"}
+test_that("add_array works as expected", {
+ # need to specify expected array as 1:10 + 10 instead of 11:20 so is double
not integer
+ expect_equal(num_array + 10, Array$create(1:10 + 10))
+})
+```
+
+### Discussion
+
+You will get the same result if you pass in the value you're adding as an
Arrow object.
+
+```{r, add_array_scalar}
+num_array + Scalar$create(10)
+```
+```{r, test_add_array_scalar, opts.label = "test"}
+test_that("add_array_scalar works as expected", {
+ # need to specify expected array as 1:10 + 10 instead of 11:20 so is double
not integer
+ expect_equal(num_array + Scalar$create(10), Array$create(1:10 + 10))
+})
+```
+
+## Calling Arrow compute functions directly on Arrays
+
+You want to call an Arrow compute function directly on an Array.
+
+### Solution
+
+```{r, call_function}
+first_100_numbers <- Array$create(1:100)
+
+# Calculate the variance of 1 to 100, setting the delta degrees of freedom to
0.
+call_function("variance", first_100_numbers, options = list(ddof = 0))
+
+```
+```{r, test_call_function, opts.label = "test"}
+test_that("call_function works as expected", {
+ expect_equal(
+ call_function("variance", first_100_numbers, options = list(ddof = 0)),
+ Scalar$create(833.25)
+ )
+})
+```
+### Discussion
+
+You can use `call_function()` to call Arrow compute functions directly on
+Scalar, Array, and ChunkedArray objects. The returned object will be an Arrow
object.
+
+### See also
+
+For a more in-depth discussion of Arrow compute functions, see the section on
(using arrow functions in dplyr verbs in
arrow)[#Using-arrow-functions-in-dplyr-verbs-in-arrow]
diff --git a/r/content/creating_arrow_objects.Rmd
b/r/content/creating_arrow_objects.Rmd
index 64d3780..37404af 100644
--- a/r/content/creating_arrow_objects.Rmd
+++ b/r/content/creating_arrow_objects.Rmd
@@ -1,4 +1,4 @@
-# Creating Arrow Objects
+# Creating Arrow Objects {#creating-arrow-objects}
## Create an Arrow Array from an R object
diff --git a/r/content/index.Rmd b/r/content/index.Rmd
index 66faf58..a3f60cc 100644
--- a/r/content/index.Rmd
+++ b/r/content/index.Rmd
@@ -18,4 +18,10 @@ knitr::opts_template$set(test = list(
# Preface
-This cookbook aims to provide a number of recipes showing how to perform
common tasks using `arrow`.
+This cookbook aims to provide a number of recipes showing how to perform
common
+tasks using `arrow`.
+
+## Alternative resources
+
+For a complete reference guide to the functions in `arrow`, as well as
vignettes,
+see [the pkgdown site](https://arrow.apache.org/docs/r/).
diff --git a/r/content/manipulating_data.Rmd b/r/content/manipulating_data.Rmd
deleted file mode 100644
index fa9e440..0000000
--- a/r/content/manipulating_data.Rmd
+++ /dev/null
@@ -1,75 +0,0 @@
-# Manipulating Data
-
-## Computing Mean/Min/Max, etc value of an Array
-
-Many base R generic functions such as `mean()`, `min()`, and `max()` have been
mapped to their Arrow equivalents, and so can be called on Arrow Array objects
in the same way. They will return Arrow objects themselves.
-
-```{r, array_mean_na}
-my_values <- Array$create(c(1:5, NA))
-mean(my_values, na.rm = TRUE)
-```
-```{r, test_array_mean_na, opts.label = "test"}
-test_that("array_mean_na works as expected", {
- expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3))
-})
-```
-If you want to use an R function which does not have an Arrow mapping, you can
use `as.vector()` to convert Arrow objects to base R vectors.
-
-```{r, fivenum}
-fivenum(as.vector(my_values))
-```
-```{r, test_fivenum, opts.label = "test"}
-test_that("fivenum works as expected", {
- expect_equal(fivenum(as.vector(my_values)), 1:5)
-})
-```
-
-## Counting occurrences of elements in an Array
-
-Some functions in the Arrow R package do not have base R equivalents. In other
cases, the base R equivalents are not generic functions so they cannot be
called directly on Arrow Array objects.
-
-For example, the `value_count()` function in the Arrow R package is loosely
equivalent to the base R function `table()`, which is not a generic function.
To count the elements in an R vector, you can use `table()`; to count the
elements in an Arrow Array, you can use `value_count()`.
-
-```{r, value_counts}
-repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
-value_counts(repeated_vals)
-```
-
-```{r, test_value_counts, opts.label = "test"}
-test_that("value_counts works as expected", {
- expect_equal(
- as.vector(value_counts(repeated_vals)),
- tibble(
- values = as.numeric(names(table(as.vector(repeated_vals)))),
- counts = as.vector(table(as.vector(repeated_vals)))
- )
- )
-})
-```
-
-## Applying arithmetic functions to Arrays.
-
-You can use the various arithmetic operators on Array objects.
-
-```{r, add_array}
-num_array <- Array$create(1:10)
-num_array + 10
-```
-```{r, test_add_array, opts.label = "test"}
-test_that("add_array works as expected", {
- # need to specify expected array as 1:10 + 10 instead of 11:20 so is double
not integer
- expect_equal(num_array + 10, Array$create(1:10 + 10))
-})
-```
-
-You will get the same result if you pass in the value you're adding as an
Arrow object.
-
-```{r, add_array_scalar}
-num_array + Scalar$create(10)
-```
-```{r, test_add_array_scalar, opts.label = "test"}
-test_that("add_array_scalar works as expected", {
- # need to specify expected array as 1:10 + 10 instead of 11:20 so is double
not integer
- expect_equal(num_array + Scalar$create(10), Array$create(1:10 + 10))
-})
-```
diff --git a/r/content/reading_and_writing_data.Rmd
b/r/content/reading_and_writing_data.Rmd
index 47274ad..8cd2e36 100644
--- a/r/content/reading_and_writing_data.Rmd
+++ b/r/content/reading_and_writing_data.Rmd
@@ -94,16 +94,16 @@ test_that("read_parquet_2 works as expected", {
If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow
Table.
```{r, read_parquet_table}
-my_table_arrow_table <- read_parquet("my_table.parquet", as_data_frame = FALSE)
-my_table_arrow_table
+my_table_arrow <- read_parquet("my_table.parquet", as_data_frame = FALSE)
+my_table_arrow
```
```{r, read_parquet_table_class}
-class(my_table_arrow_table)
+class(my_table_arrow)
```
```{r, test_read_parquet_table_class, opts.label = "test"}
test_that("read_parquet_table_class works as expected", {
- expect_s3_class(my_table_arrow_table, "Table")
+ expect_s3_class(my_table_arrow, "Table")
})
```
diff --git a/r/content/tables.Rmd b/r/content/tables.Rmd
new file mode 100644
index 0000000..1c935d7
--- /dev/null
+++ b/r/content/tables.Rmd
@@ -0,0 +1,290 @@
+# Manipulating Data - Tables
+
+## Introduction
+
+One of the aims of the Arrow project is to reduce duplication between
different
+data frame implementations. The underlying implementation of a data frame is
a
+conceptually different thing to the code that you run to work with it - the
API.
+
+You may have seen this before in packages like `dbplyr` which allow you to use
+the dplyr API to interact with SQL databases.
+
+The `arrow` package has been written so that the underlying Arrow table-like
+objects can be manipulated via use of the dplyr API via the dplyr verbs.
+
+For example, here's a short pipeline of data manipulation which uses dplyr
exclusively:
+
+```{r, dplyr_raw}
+library(dplyr)
+starwars %>%
+ filter(species == "Human") %>%
+ mutate(height_ft = height/30.48) %>%
+ select(name, height_ft)
+```
+
+And the same results as using arrow with dplyr syntax:
+
+```{r, dplyr_arrow}
+Table$create(starwars) %>%
+ filter(species == "Human") %>%
+ mutate(height_ft = height/30.48) %>%
+ select(name, height_ft) %>%
+ collect()
+```
+
+```{r, test_dplyr_raw_and_arrow, opts.label = "test"}
+
+test_that("dplyr_raw and dplyr_arrow chunk provide the same results", {
+
+ expect_equal(
+ starwars %>%
+ filter(species == "Human") %>%
+ mutate(height_ft = height/30.48) %>%
+ select(name, height_ft),
+ Table$create(starwars) %>%
+ filter(species == "Human") %>%
+ mutate(height_ft = height/30.48) %>%
+ select(name, height_ft) %>%
+ collect()
+ )
+
+})
+
+```
+
+
+You'll notice we've used `collect()` in the Arrow pipeline above. That's
because
+one of the ways in which `arrow` is efficient is that it works out the
instructions
+for the calculations it needs to perform (_expressions_) and only runs them
once
+you actually pull the data into your R session. This means instead of doing
+lots of separate operations, it does them all at once in a more optimised way,
+_lazy evaluation_.
+
+It also means that you are able to manipulate data that is larger than you can
+fit into memory on the machine you're running your code on, if you only pull
+data into R when you have selected the desired subset.
+
+You can also have data which is split across multiple files. For example, you
+might have files which are stored in multiple Parquet or Feather files,
+partitioned across different directories. You can open multi-file datasets
+using `open_dataset()` as discussed in a previous chapter, and then manipulate
+this data using arrow before even reading any of it into R.
+
+## Use dplyr verbs in arrow
+
+You want to use a dplyr verb in arrow.
+
+### Solution
+
+```{r, dplyr_verb}
+library(dplyr)
+Table$create(starwars) %>%
+ filter(species == "Human", homeworld == "Tatooine") %>%
+ collect()
+```
+
+```{r, test_dplyr_verb, opts.label = "test"}
+
+test_that("dplyr_verb works as expected", {
+ out <- Table$create(starwars) %>%
+ filter(species == "Human", homeworld == "Tatooine") %>%
+ collect()
+
+ expect_equal(nrow(out), 8)
+ expect_s3_class(out, "data.frame")
+ expect_identical(unique(out$species), "Human")
+ expect_identical(unique(out$homeworld), "Tatooine")
+})
+
+```
+
+### Discussion
+
+You can use most of the dplyr verbs directly from arrow.
+
+### See also
+
+You can find examples of the various dplyr verbs in "Introduction to dplyr" -
+run `vignette("dplyr", package = "dplyr")` or view on
+the [pkgdown site](https://dplyr.tidyverse.org/articles/dplyr.html).
+
+You can see more information about using `Table$create()` to create Arrow
Tables
+and `collect()` to view them as R data frames in [Creating Arrow
Objects](creating-arrow-objects.html#creating-arrow-objects).
+
+## Use R functions in dplyr verbs in arrow
+
+You want to use an R function inside a dplyr verb in arrow.
+
+### Solution
+
+```{r, dplyr_str_detect}
+Table$create(starwars) %>%
+ filter(str_detect(name, "Darth")) %>%
+ collect()
+```
+
+```{r, test_dplyr_str_detect, opts.label = "test"}
+
+test_that("dplyr_str_detect", {
+ out <- Table$create(starwars) %>%
+ filter(str_detect(name, "Darth")) %>%
+ collect()
+
+ expect_equal(nrow(out), 2)
+ expect_equal(sort(out$name), c("Darth Maul", "Darth Vader"))
+
+})
+
+```
+
+### Discussion
+
+The arrow package allows you to use dplyr verbs containing expressions which
+include base R and many tidyverse functions, but call Arrow functions under
the hood.
+If you find any base R or tidyverse functions which you would like to see a
+mapping of in arrow, please
+[open an issue on the project
JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
+
+If you try to call a function which does not have arrow mapping, the data will
+be pulled back into R, and you will see a warning message.
+
+```{r, dplyr_func_warning}
+library(stringr)
+
+Table$create(starwars) %>%
+ mutate(name_split = str_split_fixed(name, " ", 2)) %>%
+ collect()
+```
+
+```{r, test_dplyr_func_warning, opts.label = "test"}
+
+test_that("dplyr_func_warning", {
+
+ expect_warning(
+ Table$create(starwars) %>%
+ mutate(name_split = str_split_fixed(name, " ", 2)) %>%
+ collect(),
+ 'Expression str_split_fixed(name, " ", 2) not supported in Arrow; pulling
data into R',
+ fixed = TRUE
+ )
+
+})
+```
+## Use arrow functions in dplyr verbs in arrow
+
+You want to use a function which is implemented in Arrow's C++ library but
either:
+* it doesn't have a mapping to a base R or tidyverse equivalent, or
+* it has a mapping but nevertheless you want to call the C++ function directly
+
+### Solution
+
+```{r, dplyr_arrow_func}
+Table$create(starwars) %>%
+ select(name) %>%
+ mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10,
padding = "*"))) %>%
+ collect()
+```
+```{r, test_dplyr_arrow_func, opts.label = "test"}
+
+test_that("dplyr_arrow_func", {
+ out <- Table$create(starwars) %>%
+ select(name) %>%
+ mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10,
padding = "*"))) %>%
+ collect()
+
+ expect_match(out$padded_name, "*****C-3PO", fixed = TRUE, all = FALSE)
+
+})
+
+```
+### Discussion
+
+The vast majority of Arrow C++ compute functions have been mapped to their
+base R or tidyverse equivalents, and we strongly recommend that you use
+these mappings where possible, as the original functions are well documented
+and the mapped versions have been tested to ensure the results returned are as
+expected.
+
+However, there may be circumstances in which you might want to use a compute
+function from the Arrow C++ library which does not have a base R or tidyverse
+equivalent.
+
+You can find documentation of Arrow C++ compute functions in
+[the C++
documention](https://arrow.apache.org/docs/cpp/compute.html#available-functions).
+This documentation lists all available compute functions, any associated
options classes
+they need, and the valid data types that they can be used with.
+
+You can list all available Arrow compute functions from R by calling
+`list_compute_functions()`.
+
+```{r, list_compute_funcs}
+list_compute_functions()
+```
+```{r, test_list_compute_funcs, opts.label = "test"}
+test_that("list_compute_funcs", {
+ expect_gt(length(list_compute_functions()), 0)
+})
+```
+
+The majority of functions here have been mapped to their base R or tidyverse
+equivalent and can be called within a dplyr query as usual. For functions
which
+don't have a base R or tidyverse equivalent, or you want to supply custom
+options, you can call them by prefixing their name with "arrow_".
+
+For example, base R's `is.na()` function is the equivalent of the Arrow C++
+compute function `is_null()` with the option `nan_is_null` set to `TRUE`.
+A mapping between these functions (with `nan_is_null` set to `TRUE`) has been
+created in arrow.
+
+```{r, dplyr_is_na}
+demo_df <- data.frame(x = c(1, 2, 3, NA, NaN))
+
+Table$create(demo_df) %>%
+ mutate(y = is.na(x)) %>%
+ collect()
+```
+
+```{r, test_dplyr_is_na, opts.label = "test"}
+test_that("dplyr_is_na", {
+ out <- Table$create(demo_df) %>%
+ mutate(y = is.na(x)) %>%
+ collect()
+
+ expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, TRUE))
+
+})
+```
+
+If you want to call Arrow's `is_null()` function but with `nan_is_null` set to
+`FALSE` (so it returns `TRUE` when a value being examined is `NA` but `FALSE`
+when the value being examined is `NaN`), you must call `is_null()` directly
and
+specify the option `nan_is_null = FALSE`.
+
+```{r, dplyr_arrow_is_null}
+Table$create(demo_df) %>%
+ mutate(y = arrow_is_null(x, options = list(nan_is_null = FALSE))) %>%
+ collect()
+```
+
+```{r, test_dplyr_arrow_is_null, opts.label = "test"}
+test_that("dplyr_arrow_is_null", {
+ out <- Table$create(demo_df) %>%
+ mutate(y = arrow_is_null(x, options = list(nan_is_null = FALSE))) %>%
+ collect()
+
+ expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, FALSE))
+
+})
+```
+
+#### Compute functions with options
+
+Although not all Arrow C++ compute functions require options to be specified,
+most do. For these functions to work in R, they must be linked up
+with the appropriate libarrow options C++ class via the R
+package's C++ code. At the time of writing, all compute functions available in
+the development version of the arrow R package had been associated with their
options
+classes. However, as the Arrow C++ library's functionality extends, compute
+functions may be added which do not yet have an R binding. If you find a C++
+compute function which you wish to use from the R package, please [open an
issue
+on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
\ No newline at end of file