thisisnic commented on a change in pull request #78: URL: https://github.com/apache/arrow-cookbook/pull/78#discussion_r720685461
########## File path: r/content/tables.Rmd ########## @@ -0,0 +1,251 @@ +# Manipulating Data - Tables + +__What you should know before you begin__ + +When you call dplyr verbs from Arrow, behind the scenes this generates +instructions which tell Arrow how to manipulate the data in the way you've +specified. These instructions are called _expressions_. Until you pull the +data back into R, expressions don't do any work to actually retrieve or +manipulate any data. This is known as _lazy evaluation_ and means that you can +build up complex expressions that perform multiple actions, and are efficiently +evaluated all at once when you retrieve the data. It also means that you are +able to manipulate data that is larger than +you can fit into memory on the machine you're running your code on, if you only +pull data into R when you have selected the desired subset. + +You can also have data which is split across multiple files. For example, you +might have files which are stored in multiple Parquet or Feather files, +partitioned across different directories. You can open multi-file datasets +using `open_dataset()` as discussed in a previous chapter, and then manipulate +this data using arrow before even reading any of it into R. + +## Using dplyr verbs in arrow + +You want to use a dplyr verb in arrow. + +### Solution + +```{r, dplyr_verb} +library(dplyr) +starwars %>% + Table$create() %>% + filter(species == "Human", homeworld == "Tatooine") %>% + collect() +``` + +```{r, test_dplyr_verb, opts.label = "test"} + +test_that("dplyr_verb works as expected", { + out <- starwars %>% + Table$create() %>% + filter(species == "Human", homeworld == "Tatooine") %>% + collect() + + expect_equal(nrow(out), 8) + expect_s3_class(out, "data.frame") + expect_identical(unique(out$species), "Human") + expect_identical(unique(out$homeworld), "Tatooine") +}) + +``` + +### Discussion + +You can use most of the dplyr verbs directly from arrow. + +### See also + +You can find examples of the various dplyr verbs in "Introduction to dplyr" - +run `vignette("dplyr", package = "dplyr")` or view on +the [pkgdown site](https://dplyr.tidyverse.org/articles/dplyr.html). + +You can see more information about using `Table$create()` to create Arrow Tables +and `collect()` to view them as R data frames in [Creating Arrow Objects](creating-arrow-objects.html#creating-arrow-objects). + +## Using base R or tidyverse functions in dplyr verbs in arrow + +You want to use a tidyverse function or base R function in arrow. + +### Solution + +```{r, dplyr_str_detect} +starwars %>% + Table$create() %>% + filter(str_detect(name, "Darth")) %>% + collect() +``` + +```{r, test_dplyr_str_detect, opts.label = "test"} + +test_that("dplyr_str_detect", { + out <- starwars %>% + Table$create() %>% + filter(str_detect(name, "Darth")) %>% + collect() + + expect_equal(nrow(out), 2) + expect_equal(sort(out$name), c("Darth Maul", "Darth Vader")) + +}) + +``` + +### Discussion + +The arrow package allows you to use dplyr verbs containing expressions which +include base R and tidyverse functions, but call Arrow functions under the hood. +If you find any base R or tidyverse functions which you would like to see a +mapping of in arrow, please +[open an issue on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues). + +If you try to call a function which does not have arrow mapping, the data will +be pulled back into R, and you will see a warning message. + + +```{r, dplyr_func_warning} +library(stringr) +starwars %>% + Table$create() %>% + mutate(name_split = str_split_fixed(name, " ", 2)) %>% + collect() +``` + +```{r, test_dplyr_func_warning, opts.label = "test"} + +test_that("dplyr_func_warning", { + + expect_warning( + starwars %>% + Table$create() %>% + mutate(name_split = str_split_fixed(name, " ", 2)) %>% + collect(), + 'Expression str_split_fixed(name, " ", 2) not supported in Arrow; pulling data into R', + fixed = TRUE + ) + +}) +``` +## Using arrow functions in dplyr verbs in arrow + +You want to use a function which is implemented in Arrow's C++ library but either: +* it doesn't have a mapping to a base R or tidyverse equivalent, or +* it has a mapping but nevertheless you want to call the C++ function directly + +### Solution + +```{r, dplyr_arrow_func} +starwars %>% + Table$create() %>% + select(name) %>% + mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>% + collect() +``` +```{r, test_dplyr_arrow_func, opts.label = "test"} + +test_that("dplyr_arrow_func", { + out <- starwars %>% + Table$create() %>% + select(name) %>% + mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, padding = "*"))) %>% + collect() + + expect_match(out$padded_name, "*****C-3PO", fixed = TRUE, all = FALSE) + +}) + +``` +### Discussion + +Arrow C++ compute functions have been mapped to their +base R or tidyverse equivalents where possible, and we strongly recommend that you use +these mappings where possible, as the original functions are well documented Review comment: Ew, yeah, will rephrase; good spot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org