thisisnic commented on a change in pull request #78: URL: https://github.com/apache/arrow-cookbook/pull/78#discussion_r720686494
########## File path: r/content/tables.Rmd ########## @@ -0,0 +1,251 @@ +# Manipulating Data - Tables + +__What you should know before you begin__ + +When you call dplyr verbs from Arrow, behind the scenes this generates +instructions which tell Arrow how to manipulate the data in the way you've +specified. These instructions are called _expressions_. Until you pull the +data back into R, expressions don't do any work to actually retrieve or +manipulate any data. This is known as _lazy evaluation_ and means that you can +build up complex expressions that perform multiple actions, and are efficiently +evaluated all at once when you retrieve the data. It also means that you are +able to manipulate data that is larger than +you can fit into memory on the machine you're running your code on, if you only +pull data into R when you have selected the desired subset. + +You can also have data which is split across multiple files. For example, you +might have files which are stored in multiple Parquet or Feather files, +partitioned across different directories. You can open multi-file datasets +using `open_dataset()` as discussed in a previous chapter, and then manipulate +this data using arrow before even reading any of it into R. + +## Using dplyr verbs in arrow + +You want to use a dplyr verb in arrow. + +### Solution + +```{r, dplyr_verb} +library(dplyr) +starwars %>% + Table$create() %>% + filter(species == "Human", homeworld == "Tatooine") %>% + collect() +``` + +```{r, test_dplyr_verb, opts.label = "test"} + +test_that("dplyr_verb works as expected", { + out <- starwars %>% + Table$create() %>% + filter(species == "Human", homeworld == "Tatooine") %>% + collect() + + expect_equal(nrow(out), 8) + expect_s3_class(out, "data.frame") + expect_identical(unique(out$species), "Human") + expect_identical(unique(out$homeworld), "Tatooine") +}) + +``` + +### Discussion + +You can use most of the dplyr verbs directly from arrow. + +### See also + +You can find examples of the various dplyr verbs in "Introduction to dplyr" - +run `vignette("dplyr", package = "dplyr")` or view on +the [pkgdown site](https://dplyr.tidyverse.org/articles/dplyr.html). + +You can see more information about using `Table$create()` to create Arrow Tables +and `collect()` to view them as R data frames in [Creating Arrow Objects](creating-arrow-objects.html#creating-arrow-objects). + +## Using base R or tidyverse functions in dplyr verbs in arrow + +You want to use a tidyverse function or base R function in arrow. + +### Solution + +```{r, dplyr_str_detect} +starwars %>% + Table$create() %>% + filter(str_detect(name, "Darth")) %>% + collect() +``` + +```{r, test_dplyr_str_detect, opts.label = "test"} + +test_that("dplyr_str_detect", { + out <- starwars %>% + Table$create() %>% + filter(str_detect(name, "Darth")) %>% + collect() + + expect_equal(nrow(out), 2) + expect_equal(sort(out$name), c("Darth Maul", "Darth Vader")) + +}) + +``` + +### Discussion + +The arrow package allows you to use dplyr verbs containing expressions which +include base R and tidyverse functions, but call Arrow functions under the hood. +If you find any base R or tidyverse functions which you would like to see a +mapping of in arrow, please +[open an issue on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues). + +If you try to call a function which does not have arrow mapping, the data will +be pulled back into R, and you will see a warning message. + + +```{r, dplyr_func_warning} +library(stringr) +starwars %>% + Table$create() %>% + mutate(name_split = str_split_fixed(name, " ", 2)) %>% + collect() +``` + +```{r, test_dplyr_func_warning, opts.label = "test"} + +test_that("dplyr_func_warning", { + + expect_warning( + starwars %>% + Table$create() %>% + mutate(name_split = str_split_fixed(name, " ", 2)) %>% + collect(), + 'Expression str_split_fixed(name, " ", 2) not supported in Arrow; pulling data into R', + fixed = TRUE + ) + +}) +``` +## Using arrow functions in dplyr verbs in arrow + +You want to use a function which is implemented in Arrow's C++ library but either: +* it doesn't have a mapping to a base R or tidyverse equivalent, or +* it has a mapping but nevertheless you want to call the C++ function directly + +### Solution + +```{r, dplyr_arrow_func} +starwars %>% + Table$create() %>% + select(name) %>% + mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>% + collect() +``` +```{r, test_dplyr_arrow_func, opts.label = "test"} + +test_that("dplyr_arrow_func", { + out <- starwars %>% + Table$create() %>% + select(name) %>% + mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>% + collect() + + expect_match(out$padded_name, "*****C-3PO", fixed = TRUE, all = FALSE) + +}) + +``` +### Discussion + +Arrow C++ compute functions have been mapped to their +base R or tidyverse equivalents where possible, and we strongly recommend that you use +these mappings where possible, as the original functions are well documented +and the mapped versions have been tested to ensure the results returned are as +expected. + +However, there may be circumstances in which you might want to use a compute +function from the Arrow C++ library which does not have a base R or tidyverse +equivalent. + +You can find documentation of Arrow C++ compute functions in +[the C++ documention](https://arrow.apache.org/docs/cpp/compute.html#available-functions). +This documentation lists all available compute functions, any associated options classes +they need, and the valid data types that they can be used with. + +You can list all available Arrow compute functions from R by calling +`list_compute_functions()`. + +```{r, list_compute_funcs} +list_compute_functions() +``` +```{r, test_list_compute_funcs, opts.label = "test"} +test_that("list_compute_funcs", { + expect_gt(length(list_compute_functions()), 0) +}) +``` + +The majority of functions here have been mapped to their base R or tidyverse +equivalent and can be called within a dplyr query as usual. For functions which +don't have a base R or tidyverse equivalent, or you want to supply custom +options, you can call them by prefixing their name with "arrow_". + +For example, base R's `is.na()` function is the equivalent of the Arrow C++ +compute function `is_null()` with the option `nan_is_null` set to `TRUE`. +A mapping between these functions (with `nan_is_null` set to `TRUE`) has been +created in arrow. + +```{r, dplyr_is_na} +demo_df <- data.frame(x = c(1, 2, 3, NA, NaN)) + +demo_df %>% + Table$create() %>% + mutate(y = is.na(x)) %>% + collect() +``` + +```{r, test_dplyr_is_na, opts.label = "test"} +test_that("dplyr_is_na", { + out <- demo_df %>% + Table$create() %>% + mutate(y = is.na(x)) %>% + collect() + + expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, TRUE)) + +}) +``` + +If you want to call Arrow's `is_null()` function but with `nan_is_null` set to +`FALSE` (so it returns `TRUE` when a value being examined is `NA` but `FALSE` +when the value being examined is `NaN`), you must call `is_null()` directly and +specify the option `nan_is_null = FALSE`. + +```{r, dplyr_arrow_is_null} +demo_df %>% + Table$create() %>% + mutate(y = arrow_is_null(x, options = list(nan_is_null = FALSE))) %>% + collect() +``` + +```{r, test_dplyr_arrow_is_null, opts.label = "test"} +test_that("dplyr_arrow_is_null", { + out <- demo_df %>% + Table$create() %>% + mutate(y = arrow_is_null(x, options = list(nan_is_null = FALSE))) %>% + collect() + + expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, FALSE)) + +}) +``` + +#### Compute functions with options + +Although not all Arrow C++ compute functions require options to be specified, +most do, and for these functions to work in R, the function must be associated +with the appropriate options C++ class in the R +package's C++ code. At the time of writing, all compute functions available in +the development version of the arrow R package had been associated with their options +classes. However, as the Arrow C++ library's functionality extends, compute +functions may be added which do not yet have an R binding. If you find a C++ +compute function which you wish to use from the R package, please [open an issue +on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues). Review comment: Have rephrased to clarify. I don't mind if the reader glosses over this, as this section is only relevant to a very small niche of users. The current error messaging isn't quite there, so I opened this ticket - [ARROW-14067](https://issues.apache.org/jira/browse/ARROW-14067) - to improve the error messaging if options are not bound or not specified by the user. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org