[GitHub] [arrow-cookbook] westonpace commented on a change in pull request #78: ARROW-13732: [Doc][Cookbook] Manipulating and analyze Arrow data with dplyr verbs - R

GitBox Fri, 01 Oct 2021 12:20:18 -0700


westonpace commented on a change in pull request #78:
URL: https://github.com/apache/arrow-cookbook/pull/78#discussion_r720453431




##########
File path: r/content/arrays.Rmd
##########
@@ -0,0 +1,167 @@
+# Manipulating Data - Arrays
+
+__What you should know before you begin__
+
+An Arrow Array is roughly equivalent to an R vector - it can be used to 
+represent a single column of data, with all values having the same data type.  
+
+A number of base R functions which have S3 generic methods have been 
implemented 
+to work on Arrow Arrays; for example `mean`, `min`, and `max`.  
+
+## Computing Mean/Min/Max, etc value of an Array
+
+You want to calculate the mean, minimum, or maximum of values in an array.
+
+### Solution
+
+```{r, array_mean_na}
+my_values <- Array$create(c(1:5, NA))
+mean(my_values, na.rm = TRUE)
+```
+```{r, test_array_mean_na, opts.label = "test"}
+test_that("array_mean_na works as expected", {
+  expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3))
+})
+```
+
+### Discussion
+
+Many base R generic functions such as `mean()`, `min()`, and `max()` have been
+mapped to their Arrow equivalents, and so can be called on Arrow Array objects 
+in the same way. They will return Arrow objects themselves.
+
+If you want to use an R function which does not have an Arrow mapping, you can 
+use `as.vector()` to convert Arrow objects to base R vectors.
+
+```{r, array_fivenum}
+arrow_array <- Array$create(1:100)
+# get Tukey's five-number summary
+fivenum(as.vector(arrow_array))
+```
+```{r, test_array_fivenum, opts.label = "test"}
+
+test_that("array_fivenum works as expected", {
+  
+  # generates both an error and a warning
+  expect_warning(
+    expect_error(fivenum(arrow_array))  
+  )
+  
+  expect_identical(
+    fivenum(as.vector(arrow_array)),
+    c(1, 25.5, 50.5, 75.5, 100)
+  )
+})
+
+```
+
+You can tell if a functions is a standard S3 generic function by looking 

Review comment:
       ```suggestion
   You can tell if a function is a standard S3 generic function by looking 
   ```

##########
File path: r/content/_bookdown.yml
##########
@@ -4,10 +4,5 @@ new_session: FALSE
 clean: ["_book/*"]
 output_dir: _book
 edit: https://github.com/apache/arrow-cookbook/edit/main/r/content/%s
-rmd_files: [
-  "index.Rmd",
-  "reading_and_writing_data.Rmd",
-  "creating_arrow_objects.Rmd",
-  "specify_data_types_and_schemas.Rmd",
-  "manipulating_data.Rmd"
-]
+
+rmd_files: ["index.Rmd", "reading_and_writing_data.Rmd", 
"creating_arrow_objects.Rmd", "specify_data_types_and_schemas.Rmd", 
"arrays.Rmd", "tables.Rmd"]

Review comment:
       This seems to be a rather long line.  Any reason for the restyling?

##########
File path: r/content/tables.Rmd
##########
@@ -0,0 +1,251 @@
+# Manipulating Data - Tables
+
+__What you should know before you begin__
+
+When you call dplyr verbs from Arrow, behind the scenes this generates 
+instructions which tell Arrow how to manipulate the data in the way you've 
+specified.  These instructions are called _expressions_.  Until you pull the 
+data back into R, expressions don't do any work to actually retrieve or 
+manipulate any data. This is known as _lazy evaluation_ and means that you can 
+build up complex expressions that perform multiple actions, and are 
efficiently 
+evaluated all at once when you retrieve the data.  It also means that you are 
+able to manipulate data that is larger than
+you can fit into memory on the machine you're running your code on, if you 
only 
+pull data into R when you have selected the desired subset. 

Review comment:
       We can probably keep the language as is but our "manipulate data that is 
larger than you can fit in memory" story is broken for 5.0.0.  It gets better 
in 6.0.0 as long as you don't use arrange but if you do use arrange then we 
will still load the entire dataset into memory.

##########
File path: r/content/tables.Rmd
##########
@@ -0,0 +1,251 @@
+# Manipulating Data - Tables
+
+__What you should know before you begin__
+
+When you call dplyr verbs from Arrow, behind the scenes this generates 
+instructions which tell Arrow how to manipulate the data in the way you've 
+specified.  These instructions are called _expressions_.  Until you pull the 
+data back into R, expressions don't do any work to actually retrieve or 
+manipulate any data. This is known as _lazy evaluation_ and means that you can 
+build up complex expressions that perform multiple actions, and are 
efficiently 
+evaluated all at once when you retrieve the data.  It also means that you are 
+able to manipulate data that is larger than
+you can fit into memory on the machine you're running your code on, if you 
only 
+pull data into R when you have selected the desired subset. 
+
+You can also have data which is split across multiple files.  For example, you
+might have files which are stored in multiple Parquet or Feather files, 
+partitioned across different directories.  You can open multi-file datasets 
+using `open_dataset()` as discussed in a previous chapter, and then manipulate 
+this data using arrow before even reading any of it into R.
+

Review comment:
       Another bonus that you might call out more clearly here is that some 
file formats support metadata-based filtering (also called pushdown predicates 
or pushdown filtering).  This means that a dplyr query with a filter might not 
even read all of the data from the disk.  For example, in your `species == 
"human", homeworld == "Tatooine"` example it's possible that all `homeworld == 
"Tatooine"` files are in the `/dataset/homeworld=tatooine/` directory.  Or it's 
possible that your parquet files have row groups without any Tatooine data in 
them and we can detect and skip that row group with a statistics filter.

##########
File path: r/content/tables.Rmd
##########
@@ -0,0 +1,251 @@
+# Manipulating Data - Tables
+
+__What you should know before you begin__
+
+When you call dplyr verbs from Arrow, behind the scenes this generates 
+instructions which tell Arrow how to manipulate the data in the way you've 
+specified.  These instructions are called _expressions_.  Until you pull the 
+data back into R, expressions don't do any work to actually retrieve or 
+manipulate any data. This is known as _lazy evaluation_ and means that you can 
+build up complex expressions that perform multiple actions, and are 
efficiently 
+evaluated all at once when you retrieve the data.  It also means that you are 
+able to manipulate data that is larger than
+you can fit into memory on the machine you're running your code on, if you 
only 
+pull data into R when you have selected the desired subset. 
+
+You can also have data which is split across multiple files.  For example, you
+might have files which are stored in multiple Parquet or Feather files, 
+partitioned across different directories.  You can open multi-file datasets 
+using `open_dataset()` as discussed in a previous chapter, and then manipulate 
+this data using arrow before even reading any of it into R.
+
+## Using dplyr verbs in arrow
+
+You want to use a dplyr verb in arrow.
+
+### Solution
+
+```{r, dplyr_verb}
+library(dplyr)
+starwars %>%
+  Table$create() %>%
+  filter(species == "Human", homeworld == "Tatooine") %>%
+  collect()
+```
+
+```{r, test_dplyr_verb, opts.label = "test"}
+
+test_that("dplyr_verb works as expected", {
+  out <- starwars %>%
+  Table$create() %>%
+  filter(species == "Human", homeworld == "Tatooine") %>%
+  collect()
+
+  expect_equal(nrow(out), 8)
+  expect_s3_class(out, "data.frame")
+  expect_identical(unique(out$species), "Human")
+  expect_identical(unique(out$homeworld), "Tatooine")
+})
+
+```
+
+### Discussion
+
+You can use most of the dplyr verbs directly from arrow.  
+
+### See also
+
+You can find examples of the various dplyr verbs in "Introduction to dplyr" - 
+run `vignette("dplyr", package = "dplyr")` or view on
+the [pkgdown site](https://dplyr.tidyverse.org/articles/dplyr.html).
+
+You can see more information about using `Table$create()` to create Arrow 
Tables
+and `collect()` to view them as R data frames in [Creating Arrow 
Objects](creating-arrow-objects.html#creating-arrow-objects).
+
+## Using base R or tidyverse functions in dplyr verbs in arrow
+
+You want to use a tidyverse function or base R function in arrow.
+
+### Solution
+
+```{r, dplyr_str_detect}
+starwars %>%
+  Table$create() %>%
+  filter(str_detect(name, "Darth")) %>%
+  collect()
+```
+
+```{r, test_dplyr_str_detect, opts.label = "test"}
+
+test_that("dplyr_str_detect", {
+  out <- starwars %>%
+    Table$create() %>%
+    filter(str_detect(name, "Darth")) %>%
+    collect()
+  
+  expect_equal(nrow(out), 2)
+  expect_equal(sort(out$name), c("Darth Maul", "Darth Vader"))
+  
+})
+
+```
+
+### Discussion
+
+The arrow package allows you to use dplyr verbs containing expressions which 
+include base R and tidyverse functions, but call Arrow functions under the 
hood.
+If you find any base R or tidyverse functions which you would like to see a 
+mapping of in arrow, please 
+[open an issue on the project 
JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
+
+If you try to call a function which does not have arrow mapping, the data will 
+be pulled back into R, and you will see a warning message.
+
+
+```{r, dplyr_func_warning}
+library(stringr)
+starwars %>%
+  Table$create() %>%
+  mutate(name_split = str_split_fixed(name, " ", 2)) %>%
+  collect()
+```
+
+```{r, test_dplyr_func_warning, opts.label = "test"}
+
+test_that("dplyr_func_warning", {
+  
+  expect_warning(
+     starwars %>%
+      Table$create() %>%
+      mutate(name_split = str_split_fixed(name, " ", 2)) %>%
+      collect(),
+    'Expression str_split_fixed(name, " ", 2) not supported in Arrow; pulling 
data into R',
+    fixed = TRUE
+  )
+
+})
+```
+## Using arrow functions in dplyr verbs in arrow
+
+You want to use a function which is implemented in Arrow's C++ library but 
either:
+* it doesn't have a mapping to a base R or tidyverse equivalent, or 
+* it has a mapping but nevertheless you want to call the C++ function directly
+
+### Solution
+
+```{r, dplyr_arrow_func}
+starwars %>%
+  Table$create() %>%
+  select(name) %>%
+  mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, 
padding = "*"))) %>%
+  collect()
+```
+```{r, test_dplyr_arrow_func, opts.label = "test"}
+
+test_that("dplyr_arrow_func", {
+  out <- starwars %>%
+    Table$create() %>%
+    select(name) %>%
+    mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, 
padding = "*"))) %>%
+    collect()
+  
+  expect_match(out$padded_name, "*****C-3PO", fixed = TRUE, all = FALSE)
+  
+})
+
+```
+### Discussion
+
+Arrow C++ compute functions have been mapped to their 
+base R or tidyverse equivalents where possible, and we strongly recommend that 
you use 
+these mappings where possible, as the original functions are well documented

Review comment:
       Nit: possibly uncomfortable duplicate of `where possible`.

##########
File path: r/content/tables.Rmd
##########
@@ -0,0 +1,251 @@
+# Manipulating Data - Tables
+
+__What you should know before you begin__
+
+When you call dplyr verbs from Arrow, behind the scenes this generates 
+instructions which tell Arrow how to manipulate the data in the way you've 
+specified.  These instructions are called _expressions_.  Until you pull the 
+data back into R, expressions don't do any work to actually retrieve or 
+manipulate any data. This is known as _lazy evaluation_ and means that you can 
+build up complex expressions that perform multiple actions, and are 
efficiently 
+evaluated all at once when you retrieve the data.  It also means that you are 
+able to manipulate data that is larger than
+you can fit into memory on the machine you're running your code on, if you 
only 
+pull data into R when you have selected the desired subset. 
+
+You can also have data which is split across multiple files.  For example, you
+might have files which are stored in multiple Parquet or Feather files, 
+partitioned across different directories.  You can open multi-file datasets 
+using `open_dataset()` as discussed in a previous chapter, and then manipulate 
+this data using arrow before even reading any of it into R.
+
+## Using dplyr verbs in arrow
+
+You want to use a dplyr verb in arrow.
+
+### Solution
+
+```{r, dplyr_verb}
+library(dplyr)
+starwars %>%
+  Table$create() %>%
+  filter(species == "Human", homeworld == "Tatooine") %>%
+  collect()
+```
+
+```{r, test_dplyr_verb, opts.label = "test"}
+
+test_that("dplyr_verb works as expected", {
+  out <- starwars %>%
+  Table$create() %>%
+  filter(species == "Human", homeworld == "Tatooine") %>%
+  collect()
+
+  expect_equal(nrow(out), 8)
+  expect_s3_class(out, "data.frame")
+  expect_identical(unique(out$species), "Human")
+  expect_identical(unique(out$homeworld), "Tatooine")
+})
+
+```
+
+### Discussion
+
+You can use most of the dplyr verbs directly from arrow.  
+
+### See also
+
+You can find examples of the various dplyr verbs in "Introduction to dplyr" - 
+run `vignette("dplyr", package = "dplyr")` or view on
+the [pkgdown site](https://dplyr.tidyverse.org/articles/dplyr.html).
+
+You can see more information about using `Table$create()` to create Arrow 
Tables
+and `collect()` to view them as R data frames in [Creating Arrow 
Objects](creating-arrow-objects.html#creating-arrow-objects).
+
+## Using base R or tidyverse functions in dplyr verbs in arrow
+
+You want to use a tidyverse function or base R function in arrow.
+
+### Solution
+
+```{r, dplyr_str_detect}
+starwars %>%
+  Table$create() %>%
+  filter(str_detect(name, "Darth")) %>%
+  collect()
+```
+
+```{r, test_dplyr_str_detect, opts.label = "test"}
+
+test_that("dplyr_str_detect", {
+  out <- starwars %>%
+    Table$create() %>%
+    filter(str_detect(name, "Darth")) %>%
+    collect()
+  
+  expect_equal(nrow(out), 2)
+  expect_equal(sort(out$name), c("Darth Maul", "Darth Vader"))
+  
+})
+
+```
+
+### Discussion
+
+The arrow package allows you to use dplyr verbs containing expressions which 
+include base R and tidyverse functions, but call Arrow functions under the 
hood.
+If you find any base R or tidyverse functions which you would like to see a 
+mapping of in arrow, please 
+[open an issue on the project 
JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
+
+If you try to call a function which does not have arrow mapping, the data will 
+be pulled back into R, and you will see a warning message.
+
+
+```{r, dplyr_func_warning}
+library(stringr)
+starwars %>%
+  Table$create() %>%
+  mutate(name_split = str_split_fixed(name, " ", 2)) %>%
+  collect()
+```
+
+```{r, test_dplyr_func_warning, opts.label = "test"}
+
+test_that("dplyr_func_warning", {
+  
+  expect_warning(
+     starwars %>%
+      Table$create() %>%
+      mutate(name_split = str_split_fixed(name, " ", 2)) %>%
+      collect(),
+    'Expression str_split_fixed(name, " ", 2) not supported in Arrow; pulling 
data into R',
+    fixed = TRUE
+  )
+
+})
+```
+## Using arrow functions in dplyr verbs in arrow
+
+You want to use a function which is implemented in Arrow's C++ library but 
either:
+* it doesn't have a mapping to a base R or tidyverse equivalent, or 
+* it has a mapping but nevertheless you want to call the C++ function directly
+
+### Solution
+
+```{r, dplyr_arrow_func}
+starwars %>%
+  Table$create() %>%
+  select(name) %>%
+  mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, 
padding = "*"))) %>%
+  collect()
+```
+```{r, test_dplyr_arrow_func, opts.label = "test"}
+
+test_that("dplyr_arrow_func", {
+  out <- starwars %>%
+    Table$create() %>%
+    select(name) %>%
+    mutate(padded_name = arrow_ascii_lpad(name, options = list(width = 10, 
padding = "*"))) %>%
+    collect()
+  
+  expect_match(out$padded_name, "*****C-3PO", fixed = TRUE, all = FALSE)
+  
+})
+
+```
+### Discussion
+
+Arrow C++ compute functions have been mapped to their 
+base R or tidyverse equivalents where possible, and we strongly recommend that 
you use 
+these mappings where possible, as the original functions are well documented
+and the mapped versions have been tested to ensure the results returned are as 
+expected.
+
+However, there may be circumstances in which you might want to use a compute 
+function from the Arrow C++ library which does not have a base R or tidyverse 
+equivalent.
+
+You can find documentation of Arrow C++ compute functions in 
+[the C++ 
documention](https://arrow.apache.org/docs/cpp/compute.html#available-functions).
+This documentation lists all available compute functions, any associated 
options classes 
+they need, and the valid data types that they can be used with.
+
+You can list all available Arrow compute functions from R by calling 
+`list_compute_functions()`.
+
+```{r, list_compute_funcs}
+list_compute_functions()
+```
+```{r, test_list_compute_funcs, opts.label = "test"}
+test_that("list_compute_funcs", {
+  expect_gt(length(list_compute_functions()), 0)
+})
+```
+
+The majority of functions here have been mapped to their base R or tidyverse 
+equivalent and can be called within a dplyr query as usual.  For functions 
which
+don't have a base R or tidyverse equivalent, or you want to supply custom 
+options, you can call them by prefixing their name with "arrow_".  
+
+For example, base R's `is.na()` function is the equivalent of the Arrow C++ 
+compute function `is_null()` with the option `nan_is_null` set to `TRUE`.  
+A mapping between these functions (with `nan_is_null` set to `TRUE`) has been
+created in arrow.
+
+```{r, dplyr_is_na}
+demo_df <- data.frame(x = c(1, 2, 3, NA, NaN))
+
+demo_df %>%
+  Table$create() %>%
+  mutate(y = is.na(x)) %>% 
+  collect()
+```
+
+```{r, test_dplyr_is_na, opts.label = "test"}
+test_that("dplyr_is_na", {
+  out <- demo_df %>%
+  Table$create() %>%
+  mutate(y = is.na(x)) %>% 
+  collect()
+  
+  expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, TRUE))
+  
+})
+```
+
+If you want to call Arrow's `is_null()` function but with `nan_is_null` set to 
+`FALSE` (so it returns `TRUE` when a value being examined is `NA` but `FALSE` 
+when the value being examined is `NaN`), you must call `is_null()` directly 
and 
+specify the option `nan_is_null = FALSE`.
+
+```{r, dplyr_arrow_is_null}
+demo_df %>%
+  Table$create() %>%
+  mutate(y = arrow_is_null(x, options  = list(nan_is_null = FALSE))) %>% 
+  collect()
+```
+
+```{r, test_dplyr_arrow_is_null, opts.label = "test"}
+test_that("dplyr_arrow_is_null", {
+  out <- demo_df %>%
+    Table$create() %>%
+    mutate(y = arrow_is_null(x, options  = list(nan_is_null = FALSE))) %>% 
+    collect()
+  
+  expect_equal(out$y, c(FALSE, FALSE, FALSE, TRUE, FALSE))
+  
+})
+```
+
+#### Compute functions with options
+
+Although not all Arrow C++ compute functions require options to be specified, 
+most do, and for these functions to work in R, the function must be associated 
+with the appropriate options C++ class in the R 
+package's C++ code.  At the time of writing, all compute functions available in
+the development version of the arrow R package had been associated with their 
options
+classes.  However, as the Arrow C++ library's functionality extends, compute 
+functions may be added which do not yet have an R binding.  If you find a C++ 
+compute function which you wish to use from the R package, please [open an 
issue
+on the project JIRA](https://issues.apache.org/jira/projects/ARROW/issues).

Review comment:
       As a user, I don't really know what "the function must be associated 
with the appropriate options C++ class in the R  package's C++ code" means.
   
   I feel like the user is going to kind of gloss over this section.  What 
would a missing options error even look like?

##########
File path: r/content/arrays.Rmd
##########
@@ -0,0 +1,167 @@
+# Manipulating Data - Arrays
+
+__What you should know before you begin__
+
+An Arrow Array is roughly equivalent to an R vector - it can be used to 
+represent a single column of data, with all values having the same data type.  
+
+A number of base R functions which have S3 generic methods have been 
implemented 
+to work on Arrow Arrays; for example `mean`, `min`, and `max`.  
+
+## Computing Mean/Min/Max, etc value of an Array
+
+You want to calculate the mean, minimum, or maximum of values in an array.
+
+### Solution
+
+```{r, array_mean_na}
+my_values <- Array$create(c(1:5, NA))
+mean(my_values, na.rm = TRUE)
+```
+```{r, test_array_mean_na, opts.label = "test"}
+test_that("array_mean_na works as expected", {
+  expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3))
+})
+```
+
+### Discussion
+
+Many base R generic functions such as `mean()`, `min()`, and `max()` have been
+mapped to their Arrow equivalents, and so can be called on Arrow Array objects 
+in the same way. They will return Arrow objects themselves.
+
+If you want to use an R function which does not have an Arrow mapping, you can 
+use `as.vector()` to convert Arrow objects to base R vectors.
+
+```{r, array_fivenum}
+arrow_array <- Array$create(1:100)
+# get Tukey's five-number summary
+fivenum(as.vector(arrow_array))
+```
+```{r, test_array_fivenum, opts.label = "test"}
+
+test_that("array_fivenum works as expected", {
+  
+  # generates both an error and a warning
+  expect_warning(
+    expect_error(fivenum(arrow_array))  
+  )
+  
+  expect_identical(
+    fivenum(as.vector(arrow_array)),
+    c(1, 25.5, 50.5, 75.5, 100)
+  )
+})
+
+```
+
+You can tell if a functions is a standard S3 generic function by looking 
+at the body of the function - S3 generic functions call `UseMethod()`
+to determine the appropriate version of that function to use for the object.
+
+```{r}
+mean
+```
+
+You can use `isS3stdGeneric()` to determine if a function is an S3 generic.

Review comment:
       ```suggestion
   You can also use `isS3stdGeneric()` to determine if a function is an S3 
generic.
   ```

##########
File path: r/content/arrays.Rmd
##########
@@ -0,0 +1,167 @@
+# Manipulating Data - Arrays
+
+__What you should know before you begin__
+
+An Arrow Array is roughly equivalent to an R vector - it can be used to 
+represent a single column of data, with all values having the same data type.  
+
+A number of base R functions which have S3 generic methods have been 
implemented 
+to work on Arrow Arrays; for example `mean`, `min`, and `max`.  
+
+## Computing Mean/Min/Max, etc value of an Array
+
+You want to calculate the mean, minimum, or maximum of values in an array.
+
+### Solution
+
+```{r, array_mean_na}
+my_values <- Array$create(c(1:5, NA))
+mean(my_values, na.rm = TRUE)
+```
+```{r, test_array_mean_na, opts.label = "test"}
+test_that("array_mean_na works as expected", {
+  expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3))
+})
+```
+
+### Discussion
+
+Many base R generic functions such as `mean()`, `min()`, and `max()` have been
+mapped to their Arrow equivalents, and so can be called on Arrow Array objects 
+in the same way. They will return Arrow objects themselves.
+
+If you want to use an R function which does not have an Arrow mapping, you can 
+use `as.vector()` to convert Arrow objects to base R vectors.
+
+```{r, array_fivenum}
+arrow_array <- Array$create(1:100)
+# get Tukey's five-number summary
+fivenum(as.vector(arrow_array))
+```
+```{r, test_array_fivenum, opts.label = "test"}
+
+test_that("array_fivenum works as expected", {
+  
+  # generates both an error and a warning
+  expect_warning(
+    expect_error(fivenum(arrow_array))  
+  )
+  
+  expect_identical(
+    fivenum(as.vector(arrow_array)),
+    c(1, 25.5, 50.5, 75.5, 100)
+  )
+})
+
+```
+
+You can tell if a functions is a standard S3 generic function by looking 
+at the body of the function - S3 generic functions call `UseMethod()`
+to determine the appropriate version of that function to use for the object.
+
+```{r}
+mean
+```
+
+You can use `isS3stdGeneric()` to determine if a function is an S3 generic.
+
+```{r}
+isS3stdGeneric("mean")
+```
+
+If you find an S3 generic function which isn't implemented for Arrow objects 
+but you would like to be able to use, please 
+[open an issue on the project 
JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
+
+## Counting occurrences of elements in an Array
+
+You want to count repeated values in an Array.
+
+### Solution
+
+```{r, value_counts}
+repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
+value_counts(repeated_vals)
+```
+
+```{r, test_value_counts, opts.label = "test"}
+test_that("value_counts works as expected", {
+  expect_equal(
+    as.vector(value_counts(repeated_vals)),
+    tibble(
+      values = as.numeric(names(table(as.vector(repeated_vals)))),
+      counts = as.vector(table(as.vector(repeated_vals)))
+    )
+  )
+})
+```
+
+### Discussion
+
+Some functions in the Arrow R package do not have base R equivalents. In other 
+cases, the base R equivalents are not generic functions so they cannot be 
called
+directly on Arrow Array objects.
+
+For example, the `value_count()` function in the Arrow R package is loosely 

Review comment:
       ```suggestion
   For example, the `value_counts()` function in the Arrow R package is loosely 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-cookbook] westonpace commented on a change in pull request #78: ARROW-13732: [Doc][Cookbook] Manipulating and analyze Arrow data with dplyr verbs - R

Reply via email to