[GitHub] [arrow-cookbook] westonpace commented on a change in pull request #1: Initial content for Arrow Cookbook for Python and R

GitBox Thu, 22 Jul 2021 09:11:12 -0700


westonpace commented on a change in pull request #1:
URL: https://github.com/apache/arrow-cookbook/pull/1#discussion_r674934317




##########
File path: r/content/manipulating_data.Rmd
##########
@@ -0,0 +1,66 @@
+# Manipulating Data
+
+## Computing Mean/Min/Max, etc value of an array
+
+Many base R functions such as `mean()`, `min()`, and `max()` have been mapped 
to their Arrow equivalents, and so can be called on Arrow Array objects in the 
same way.  They will return Arrow object themselves.
+
+```{r, array_mean_na}
+my_values <- Array$create(c(1:5, NA))
+mean(my_values, na.rm = TRUE)
+```
+```{r, test_array_mean_na, opts.label = "test"}
+test_that("array_mean_na works as expected", {
+  expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3))
+})
+```
+If you want to use an R function which does not have an Arrow mapping, you can 
use `as.vector()` to convert Arrow objects to base R vectors.
+
+```{r, fivenum}
+fivenum(as.vector(my_values))
+```
+```{r, test_fivenum, opts.label = "test"}
+test_that("fivenum works as expected", {
+  expect_equal(fivenum(as.vector(my_values)), 1:5)
+})
+```
+
+## Counting occurrences of elements in an array
+
+You can use `value_count()` to count elements in an Array.
+
+```{r, value_counts}
+repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
+value_counts(repeated_vals)
+```
+```{r, test_value_counts, opts.label = "test"}
+test_that("value_counts works as expected", {
+  expect_equal(as.vector(value_counts(repeated_vals)), tibble::tibble(values = 
c(1, 2, 3), counts = c(2L, 1L, 5L)))
+})
+```
+
+## Applying arithmetic functions to arrays.
+
+You can use the various arithmetic operators on Array objects.
+
+```{r, add_array}
+num_array <- Array$create(1:10)
+num_array + 10
+```
+```{r, test_add_array, opts.label = "test"}
+test_that("add_array works as expected", {
+  # need to specify expected array as 1:10 as 10 instead of 11:20 so is double 
not integer
+  expect_equal(num_array + 10, Array$create(1:10 + 10))
+})
+```
+
+You'll get the same result if you pass in the value you're adding as an Arrow 
object.

Review comment:
       ```suggestion
   You will get the same result if you pass in the value you're adding as an 
Arrow object.
   ```

##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -0,0 +1,255 @@
+# Reading and Writing Data
+
+This chapter contains recipes related to reading and writing data from disk 
using Apache Arrow.
+
+## Reading and Writing Parquet Files
+
+### Writing a Parquet file
+
+You can write Parquet files to disk using `arrow::write_parquet()`.
+```{r, write_parquet}
+# Create table
+my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = 
c(99, 97, 99)))
+# Write to Parquet
+write_parquet(my_table, "my_table.parquet")
+```
+```{r, test_write_parquet, opts.label = "test"}
+test_that("write_parquet chunk works as expected", {
+  expect_true(file.exists("my_table.parquet"))
+})
+```
+ 
+### Reading a Parquet file
+
+Given a Parquet file, it can be read back to an Arrow Table by using 
`arrow::read_parquet()`.
+
+```{r, read_parquet}
+parquet_tbl <- read_parquet("my_table.parquet")
+head(parquet_tbl)
+```
+```{r, test_read_parquet, opts.label = "test"}
+test_that("read_parquet works as expected", {
+  expect_equivalent(dplyr::collect(parquet_tbl), tibble::tibble(group = c("A", 
"B", "C"), score = c(99, 97, 99)))
+})
+```
+
+If the argument `as_data_frame` was set to `TRUE` (the default), the file was 
read in as a `data.frame` object.
+
+```{r, read_parquet_2}
+class(parquet_tbl)
+```
+```{r, test_read_parquet_2, opts.label = "test"}
+test_that("read_parquet_2 works as expected", {
+  expect_s3_class(parquet_tbl, "data.frame")
+})
+```
+If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow 
Table.
+
+```{r, read_parquet_table}
+my_table_arrow_table <- read_parquet("my_table.parquet", as_data_frame = FALSE)
+head(my_table_arrow_table)
+```
+
+```{r, read_parquet_table_class}
+class(my_table_arrow_table)
+```
+```{r, test_read_parquet_table_class, opts.label = "test"}
+test_that("read_parquet_table_class works as expected", {
+  expect_s3_class(my_table_arrow_table, "Table")
+})
+```
+
+## How to read a (partitioned) Parquet file from S3 
+
+You can open a Parquet file saved on S3 by calling `read_parquet()` and 
passing the relevant URI as the `file` argument.
+
+```{r, read_parquet_s3, eval = FALSE}
+df <- read_parquet(file = "s3://ursa-labs-taxi-data/2019/06/data.parquet")
+```
+For more in-depth instructions, including how to work with S3 buckets which 
require authentication, you can find a guide to reading and writing to/from S3 
buckets here: https://arrow.apache.org/docs/r/articles/fs.html.
+
+## How to filter rows or columns while reading a Parquet file 

Review comment:
       ```suggestion
   ## How to filter columns while reading a Parquet file 
   ```

##########
File path: r/content/manipulating_data.Rmd
##########
@@ -0,0 +1,66 @@
+# Manipulating Data
+
+## Computing Mean/Min/Max, etc value of an array
+
+Many base R functions such as `mean()`, `min()`, and `max()` have been mapped 
to their Arrow equivalents, and so can be called on Arrow Array objects in the 
same way.  They will return Arrow object themselves.
+
+```{r, array_mean_na}
+my_values <- Array$create(c(1:5, NA))
+mean(my_values, na.rm = TRUE)
+```
+```{r, test_array_mean_na, opts.label = "test"}
+test_that("array_mean_na works as expected", {
+  expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3))
+})
+```
+If you want to use an R function which does not have an Arrow mapping, you can 
use `as.vector()` to convert Arrow objects to base R vectors.
+
+```{r, fivenum}
+fivenum(as.vector(my_values))
+```
+```{r, test_fivenum, opts.label = "test"}
+test_that("fivenum works as expected", {
+  expect_equal(fivenum(as.vector(my_values)), 1:5)
+})
+```
+
+## Counting occurrences of elements in an array
+
+You can use `value_count()` to count elements in an Array.
+
+```{r, value_counts}
+repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
+value_counts(repeated_vals)
+```
+```{r, test_value_counts, opts.label = "test"}
+test_that("value_counts works as expected", {
+  expect_equal(as.vector(value_counts(repeated_vals)), tibble::tibble(values = 
c(1, 2, 3), counts = c(2L, 1L, 5L)))
+})
+```
+
+## Applying arithmetic functions to arrays.
+
+You can use the various arithmetic operators on Array objects.
+
+```{r, add_array}
+num_array <- Array$create(1:10)
+num_array + 10
+```
+```{r, test_add_array, opts.label = "test"}
+test_that("add_array works as expected", {
+  # need to specify expected array as 1:10 as 10 instead of 11:20 so is double 
not integer

Review comment:
       ```suggestion
     # need to specify expected array as 1:10 + 10 instead of 11:20 so is 
double not integer
   ```

##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -0,0 +1,255 @@
+# Reading and Writing Data
+
+This chapter contains recipes related to reading and writing data from disk 
using Apache Arrow.

Review comment:
       The other chapters don't have a "This chapter contains..." header.  
Maybe have it on all or none?

##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -0,0 +1,255 @@
+# Reading and Writing Data
+
+This chapter contains recipes related to reading and writing data from disk 
using Apache Arrow.
+
+## Reading and Writing Parquet Files
+
+### Writing a Parquet file
+
+You can write Parquet files to disk using `arrow::write_parquet()`.
+```{r, write_parquet}
+# Create table
+my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = 
c(99, 97, 99)))
+# Write to Parquet
+write_parquet(my_table, "my_table.parquet")
+```
+```{r, test_write_parquet, opts.label = "test"}
+test_that("write_parquet chunk works as expected", {
+  expect_true(file.exists("my_table.parquet"))
+})
+```
+ 
+### Reading a Parquet file
+
+Given a Parquet file, it can be read back to an Arrow Table by using 
`arrow::read_parquet()`.
+
+```{r, read_parquet}
+parquet_tbl <- read_parquet("my_table.parquet")
+head(parquet_tbl)
+```
+```{r, test_read_parquet, opts.label = "test"}
+test_that("read_parquet works as expected", {
+  expect_equivalent(dplyr::collect(parquet_tbl), tibble::tibble(group = c("A", 
"B", "C"), score = c(99, 97, 99)))
+})
+```
+
+If the argument `as_data_frame` was set to `TRUE` (the default), the file was 
read in as a `data.frame` object.
+
+```{r, read_parquet_2}
+class(parquet_tbl)
+```
+```{r, test_read_parquet_2, opts.label = "test"}
+test_that("read_parquet_2 works as expected", {
+  expect_s3_class(parquet_tbl, "data.frame")
+})
+```
+If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow 
Table.
+
+```{r, read_parquet_table}
+my_table_arrow_table <- read_parquet("my_table.parquet", as_data_frame = FALSE)
+head(my_table_arrow_table)
+```
+
+```{r, read_parquet_table_class}
+class(my_table_arrow_table)
+```
+```{r, test_read_parquet_table_class, opts.label = "test"}
+test_that("read_parquet_table_class works as expected", {
+  expect_s3_class(my_table_arrow_table, "Table")
+})
+```
+
+## How to read a (partitioned) Parquet file from S3 

Review comment:
       ```suggestion
   ## How to read a Parquet file from S3 
   ```

##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -0,0 +1,255 @@
+# Reading and Writing Data
+
+This chapter contains recipes related to reading and writing data from disk 
using Apache Arrow.
+
+## Reading and Writing Parquet Files
+
+### Writing a Parquet file
+
+You can write Parquet files to disk using `arrow::write_parquet()`.
+```{r, write_parquet}
+# Create table
+my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = 
c(99, 97, 99)))
+# Write to Parquet
+write_parquet(my_table, "my_table.parquet")
+```
+```{r, test_write_parquet, opts.label = "test"}
+test_that("write_parquet chunk works as expected", {
+  expect_true(file.exists("my_table.parquet"))
+})
+```
+ 
+### Reading a Parquet file
+
+Given a Parquet file, it can be read back to an Arrow Table by using 
`arrow::read_parquet()`.
+
+```{r, read_parquet}
+parquet_tbl <- read_parquet("my_table.parquet")
+head(parquet_tbl)
+```
+```{r, test_read_parquet, opts.label = "test"}
+test_that("read_parquet works as expected", {
+  expect_equivalent(dplyr::collect(parquet_tbl), tibble::tibble(group = c("A", 
"B", "C"), score = c(99, 97, 99)))
+})
+```
+
+If the argument `as_data_frame` was set to `TRUE` (the default), the file was 
read in as a `data.frame` object.
+
+```{r, read_parquet_2}
+class(parquet_tbl)
+```
+```{r, test_read_parquet_2, opts.label = "test"}
+test_that("read_parquet_2 works as expected", {
+  expect_s3_class(parquet_tbl, "data.frame")
+})
+```
+If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow 
Table.
+
+```{r, read_parquet_table}
+my_table_arrow_table <- read_parquet("my_table.parquet", as_data_frame = FALSE)
+head(my_table_arrow_table)
+```
+
+```{r, read_parquet_table_class}
+class(my_table_arrow_table)
+```
+```{r, test_read_parquet_table_class, opts.label = "test"}
+test_that("read_parquet_table_class works as expected", {
+  expect_s3_class(my_table_arrow_table, "Table")
+})
+```
+
+## How to read a (partitioned) Parquet file from S3 
+
+You can open a Parquet file saved on S3 by calling `read_parquet()` and 
passing the relevant URI as the `file` argument.
+
+```{r, read_parquet_s3, eval = FALSE}
+df <- read_parquet(file = "s3://ursa-labs-taxi-data/2019/06/data.parquet")
+```
+For more in-depth instructions, including how to work with S3 buckets which 
require authentication, you can find a guide to reading and writing to/from S3 
buckets here: https://arrow.apache.org/docs/r/articles/fs.html.
+
+## How to filter rows or columns while reading a Parquet file 
+
+When reading in a Parquet file, you can specify which columns to read in via 
the `col_select` argument.
+
+```{r, read_parquet_filter}
+# Create table to read back in 
+dist_time <- Table$create(tibble::tibble(distance = c(12.2, 15.7, 14.2), time 
= c(43, 44, 40)))
+# Write to Parquet
+write_parquet(dist_time, "dist_time.parquet")
+
+# Read in only the "time" column
+time_only <- read_parquet("dist_time.parquet", col_select = "time")
+head(time_only)
+```
+```{r, test_read_parquet_filter, opts.label = "test"}
+test_that("read_parquet_filter works as expected", {
+  expect_identical(time_only, tibble::tibble(time = c(43, 44, 40)))
+})
+```
+
+## Reading and Writing CSV files 
+
+You can use `write_csv_arrow()` to save an Arrow Table to disk as a CSV.
+
+```{r, write_csv_arrow}
+write_csv_arrow(cars, "cars.csv")
+```
+```{r, test_write_csv_arrow, opts.label = "test"}
+test_that("write_csv_arrow chunk works as expected", {
+  expect_true(file.exists("cars.csv"))
+})
+```
+
+You can use `read_csv_arrow()` to read in a CSV file as an Arrow Table.
+
+```{r, read_csv_arrow}
+my_csv <- read_csv_arrow("cars.csv")
+```
+
+```{r, test_read_csv_arrow, opts.label = "test"}
+test_that("read_csv_arrow chunk works as expected", {
+  expect_equivalent(dplyr::collect(my_csv), cars)
+})
+```
+
+## Reading and Writing Partitioned Data 
+
+### Writing Partitioned Data
+
+You can use `write_dataset()` to save data to disk in partitions based on 
columns in the data.
+
+```{r, write_dataset}
+write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", 
"Day"))
+list.files("airquality_partitioned")
+```
+```{r, test_write_dataset, opts.label = "test"}
+test_that("write_dataset chunk works as expected", {
+  # Partition by month
+  expect_identical(list.files("airquality_partitioned"), c("Month=5", 
"Month=6", "Month=7", "Month=8", "Month=9"))
+  # We have enough files
+  expect_equal(length(list.files("airquality_partitioned", recursive = TRUE)), 
153)
+})
+```
+As you can see, this has created folders based on the first partition variable 
supplied, `Month`.
+
+If you take a look in one of these folders, you will see that the data is then 
partitioned by the second partition variable, `Day`.
+
+```{r}
+list.files("airquality_partitioned/Month=5")
+```
+
+Each of these folders contains 1 or more Parquet files containing the relevant 
partition of the data.
+
+```{r}
+list.files("airquality_partitioned/Month=5/Day=10")
+```
+
+### Reading Partitioned Data
+
+You can use `open_dataset()` to read partitioned data.
+
+```{r, open_dataset}
+# Write some partitioned data to disk to read back in
+write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", 
"Day"))
+
+# Read data from directory
+air_data <- open_dataset("airquality_partitioned")
+
+# View data
+air_data
+```
+```{r, test_open_dataset, opts.label = "test"}
+test_that("open_dataset chunk works as expected", {
+  expect_equal(nrow(air_data), 153)
+  expect_equal(arrange(collect(air_data), Month, Day), arrange(airquality, 
Month, Day), ignore_attr = TRUE)
+})
+```
+
+## Reading and Writing Feather files 

Review comment:
       This seems out of order...
   
   * Read parquet files # single file
   * Read datasets # multiple files
   * Read IPC files # single file
   
   Maybe it would be more natural to have
   
   * Read parquet files # single file
   * Read IPC files # single file
   * Read datasets # multiple files
   
   

##########
File path: r/content/manipulating_data.Rmd
##########
@@ -0,0 +1,66 @@
+# Manipulating Data
+
+## Computing Mean/Min/Max, etc value of an array
+
+Many base R functions such as `mean()`, `min()`, and `max()` have been mapped 
to their Arrow equivalents, and so can be called on Arrow Array objects in the 
same way.  They will return Arrow object themselves.
+
+```{r, array_mean_na}
+my_values <- Array$create(c(1:5, NA))
+mean(my_values, na.rm = TRUE)
+```
+```{r, test_array_mean_na, opts.label = "test"}
+test_that("array_mean_na works as expected", {
+  expect_equal(mean(my_values, na.rm = TRUE), Scalar$create(3))
+})
+```
+If you want to use an R function which does not have an Arrow mapping, you can 
use `as.vector()` to convert Arrow objects to base R vectors.
+
+```{r, fivenum}
+fivenum(as.vector(my_values))
+```
+```{r, test_fivenum, opts.label = "test"}
+test_that("fivenum works as expected", {
+  expect_equal(fivenum(as.vector(my_values)), 1:5)
+})
+```
+
+## Counting occurrences of elements in an array
+
+You can use `value_count()` to count elements in an Array.
+
+```{r, value_counts}
+repeated_vals <- Array$create(c(1, 1, 2, 3, 3, 3, 3, 3))
+value_counts(repeated_vals)
+```
+```{r, test_value_counts, opts.label = "test"}
+test_that("value_counts works as expected", {
+  expect_equal(as.vector(value_counts(repeated_vals)), tibble::tibble(values = 
c(1, 2, 3), counts = c(2L, 1L, 5L)))
+})
+```
+
+## Applying arithmetic functions to arrays.
+
+You can use the various arithmetic operators on Array objects.
+
+```{r, add_array}
+num_array <- Array$create(1:10)
+num_array + 10
+```
+```{r, test_add_array, opts.label = "test"}
+test_that("add_array works as expected", {
+  # need to specify expected array as 1:10 as 10 instead of 11:20 so is double 
not integer
+  expect_equal(num_array + 10, Array$create(1:10 + 10))
+})
+```
+
+You'll get the same result if you pass in the value you're adding as an Arrow 
object.
+
+```{r, add_array_scalar}
+num_array + Scalar$create(10)
+```
+```{r, test_add_array_scalar, opts.label = "test"}
+test_that("add_array_scalar works as expected", {
+  # need to specify expected array as 1:10 as 10 instead of 11:20 so is double 
not integer

Review comment:
       ```suggestion
     # need to specify expected array as 1:10 + 10 instead of 11:20 so is 
double not integer
   ```

##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -0,0 +1,255 @@
+# Reading and Writing Data
+
+This chapter contains recipes related to reading and writing data from disk 
using Apache Arrow.
+
+## Reading and Writing Parquet Files
+
+### Writing a Parquet file
+
+You can write Parquet files to disk using `arrow::write_parquet()`.
+```{r, write_parquet}
+# Create table
+my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = 
c(99, 97, 99)))
+# Write to Parquet
+write_parquet(my_table, "my_table.parquet")
+```
+```{r, test_write_parquet, opts.label = "test"}
+test_that("write_parquet chunk works as expected", {
+  expect_true(file.exists("my_table.parquet"))
+})
+```
+ 
+### Reading a Parquet file
+
+Given a Parquet file, it can be read back to an Arrow Table by using 
`arrow::read_parquet()`.
+
+```{r, read_parquet}
+parquet_tbl <- read_parquet("my_table.parquet")
+head(parquet_tbl)
+```
+```{r, test_read_parquet, opts.label = "test"}
+test_that("read_parquet works as expected", {
+  expect_equivalent(dplyr::collect(parquet_tbl), tibble::tibble(group = c("A", 
"B", "C"), score = c(99, 97, 99)))
+})
+```
+
+If the argument `as_data_frame` was set to `TRUE` (the default), the file was 
read in as a `data.frame` object.
+
+```{r, read_parquet_2}
+class(parquet_tbl)
+```
+```{r, test_read_parquet_2, opts.label = "test"}
+test_that("read_parquet_2 works as expected", {
+  expect_s3_class(parquet_tbl, "data.frame")
+})
+```
+If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow 
Table.
+
+```{r, read_parquet_table}
+my_table_arrow_table <- read_parquet("my_table.parquet", as_data_frame = FALSE)
+head(my_table_arrow_table)
+```
+
+```{r, read_parquet_table_class}
+class(my_table_arrow_table)
+```
+```{r, test_read_parquet_table_class, opts.label = "test"}
+test_that("read_parquet_table_class works as expected", {
+  expect_s3_class(my_table_arrow_table, "Table")
+})
+```
+
+## How to read a (partitioned) Parquet file from S3 
+
+You can open a Parquet file saved on S3 by calling `read_parquet()` and 
passing the relevant URI as the `file` argument.
+
+```{r, read_parquet_s3, eval = FALSE}
+df <- read_parquet(file = "s3://ursa-labs-taxi-data/2019/06/data.parquet")
+```
+For more in-depth instructions, including how to work with S3 buckets which 
require authentication, you can find a guide to reading and writing to/from S3 
buckets here: https://arrow.apache.org/docs/r/articles/fs.html.
+
+## How to filter rows or columns while reading a Parquet file 
+
+When reading in a Parquet file, you can specify which columns to read in via 
the `col_select` argument.
+
+```{r, read_parquet_filter}
+# Create table to read back in 
+dist_time <- Table$create(tibble::tibble(distance = c(12.2, 15.7, 14.2), time 
= c(43, 44, 40)))
+# Write to Parquet
+write_parquet(dist_time, "dist_time.parquet")
+
+# Read in only the "time" column
+time_only <- read_parquet("dist_time.parquet", col_select = "time")
+head(time_only)
+```
+```{r, test_read_parquet_filter, opts.label = "test"}
+test_that("read_parquet_filter works as expected", {
+  expect_identical(time_only, tibble::tibble(time = c(43, 44, 40)))
+})
+```
+
+## Reading and Writing CSV files 
+
+You can use `write_csv_arrow()` to save an Arrow Table to disk as a CSV.
+
+```{r, write_csv_arrow}
+write_csv_arrow(cars, "cars.csv")
+```
+```{r, test_write_csv_arrow, opts.label = "test"}
+test_that("write_csv_arrow chunk works as expected", {
+  expect_true(file.exists("cars.csv"))
+})
+```
+
+You can use `read_csv_arrow()` to read in a CSV file as an Arrow Table.
+
+```{r, read_csv_arrow}
+my_csv <- read_csv_arrow("cars.csv")
+```
+
+```{r, test_read_csv_arrow, opts.label = "test"}
+test_that("read_csv_arrow chunk works as expected", {
+  expect_equivalent(dplyr::collect(my_csv), cars)
+})
+```
+
+## Reading and Writing Partitioned Data 
+
+### Writing Partitioned Data
+
+You can use `write_dataset()` to save data to disk in partitions based on 
columns in the data.
+
+```{r, write_dataset}
+write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", 
"Day"))
+list.files("airquality_partitioned")
+```
+```{r, test_write_dataset, opts.label = "test"}
+test_that("write_dataset chunk works as expected", {
+  # Partition by month
+  expect_identical(list.files("airquality_partitioned"), c("Month=5", 
"Month=6", "Month=7", "Month=8", "Month=9"))
+  # We have enough files
+  expect_equal(length(list.files("airquality_partitioned", recursive = TRUE)), 
153)
+})
+```
+As you can see, this has created folders based on the first partition variable 
supplied, `Month`.
+
+If you take a look in one of these folders, you will see that the data is then 
partitioned by the second partition variable, `Day`.
+
+```{r}
+list.files("airquality_partitioned/Month=5")
+```
+
+Each of these folders contains 1 or more Parquet files containing the relevant 
partition of the data.
+
+```{r}
+list.files("airquality_partitioned/Month=5/Day=10")
+```
+
+### Reading Partitioned Data
+
+You can use `open_dataset()` to read partitioned data.
+
+```{r, open_dataset}
+# Write some partitioned data to disk to read back in
+write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", 
"Day"))
+
+# Read data from directory
+air_data <- open_dataset("airquality_partitioned")
+
+# View data
+air_data
+```
+```{r, test_open_dataset, opts.label = "test"}
+test_that("open_dataset chunk works as expected", {
+  expect_equal(nrow(air_data), 153)
+  expect_equal(arrange(collect(air_data), Month, Day), arrange(airquality, 
Month, Day), ignore_attr = TRUE)
+})
+```
+
+## Reading and Writing Feather files 
+
+### Write an IPC/Feather V2 file
+
+The Arrow IPC file format is identical to the Feather version 2 format.  If 
you call `write_arrow()`, you will get a warning telling you to use 
`write_feather()` instead.
+
+```{r, write_arrow}
+# Create table
+my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = 
c(99, 97, 99)))
+write_arrow(my_table, "my_table.arrow")
+```
+```{r, test_write_arrow, opts.label = "test"}
+test_that("write_arrow chunk works as expected", {
+  expect_true(file.exists("my_table.arrow"))
+  expect_warning(
+    write_arrow(iris, "my_table.arrow"),
+    regexp = "Use 'write_ipc_stream' or 'write_feather' instead."
+  )
+})
+```
+
+Instead, you can use `write_feather()`.
+
+```{r, write_feather}
+my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = 
c(99, 97, 99)))
+write_feather(my_table, "my_table.arrow")
+```
+```{r, test_write_feather, opts.label = "test"}
+test_that("write_feather chunk works as expected", {
+  expect_true(file.exists("my_table.arrow"))
+})
+```
+### Write a Feather (version 1) file
+
+You can write data in the original Feather format by setting the `version` 
parameter to `1`.

Review comment:
       ```suggestion
   For legacy support, you can write data in the original Feather format by 
setting the `version` parameter to `1`.
   ```
   I'm a little torn on recipes demonstrating deprecated behavior

##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -0,0 +1,255 @@
+# Reading and Writing Data
+
+This chapter contains recipes related to reading and writing data from disk 
using Apache Arrow.
+
+## Reading and Writing Parquet Files
+
+### Writing a Parquet file
+
+You can write Parquet files to disk using `arrow::write_parquet()`.
+```{r, write_parquet}
+# Create table
+my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = 
c(99, 97, 99)))
+# Write to Parquet
+write_parquet(my_table, "my_table.parquet")
+```
+```{r, test_write_parquet, opts.label = "test"}
+test_that("write_parquet chunk works as expected", {
+  expect_true(file.exists("my_table.parquet"))
+})
+```
+ 
+### Reading a Parquet file
+
+Given a Parquet file, it can be read back to an Arrow Table by using 
`arrow::read_parquet()`.

Review comment:
       This recipe is a little confusing.  First you say "it can be read back 
to an Arrow Table" but then you explicitly point out that it is, in fact, a 
data.frame.  Then you say "If...the file will be read in as an Arrow table" 
which sort of conflicts with the first sentence.
   
   Maybe...
   
   Given a parquet file, it can be read in using `arrow::read_parquet()`.  By 
default this will read it in as a `data.frame` object.
   
   ...
   
   To get an Arrow Table we can set the `as_data_frame` argument to `TRUE`
   
   ...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-cookbook] westonpace commented on a change in pull request #1: Initial content for Arrow Cookbook for Python and R

Reply via email to