[GitHub] [arrow-cookbook] ianmcook commented on a change in pull request #1: Initial content for Arrow Cookbook for Python and R

GitBox Thu, 22 Jul 2021 09:59:54 -0700


ianmcook commented on a change in pull request #1:
URL: https://github.com/apache/arrow-cookbook/pull/1#discussion_r674990064




##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -0,0 +1,255 @@
+# Reading and Writing Data
+
+This chapter contains recipes related to reading and writing data from disk 
using Apache Arrow.
+
+## Reading and Writing Parquet Files
+
+### Writing a Parquet file
+
+You can write Parquet files to disk using `arrow::write_parquet()`.
+```{r, write_parquet}
+# Create table
+my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = 
c(99, 97, 99)))
+# Write to Parquet
+write_parquet(my_table, "my_table.parquet")
+```
+```{r, test_write_parquet, opts.label = "test"}
+test_that("write_parquet chunk works as expected", {
+  expect_true(file.exists("my_table.parquet"))
+})
+```
+ 
+### Reading a Parquet file
+
+Given a Parquet file, it can be read back to an Arrow Table by using 
`arrow::read_parquet()`.
+
+```{r, read_parquet}
+parquet_tbl <- read_parquet("my_table.parquet")
+head(parquet_tbl)
+```
+```{r, test_read_parquet, opts.label = "test"}
+test_that("read_parquet works as expected", {
+  expect_equivalent(dplyr::collect(parquet_tbl), tibble::tibble(group = c("A", 
"B", "C"), score = c(99, 97, 99)))
+})
+```
+
+If the argument `as_data_frame` was set to `TRUE` (the default), the file was 
read in as a `data.frame` object.
+
+```{r, read_parquet_2}
+class(parquet_tbl)
+```
+```{r, test_read_parquet_2, opts.label = "test"}
+test_that("read_parquet_2 works as expected", {
+  expect_s3_class(parquet_tbl, "data.frame")
+})
+```
+If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow 
Table.
+
+```{r, read_parquet_table}
+my_table_arrow_table <- read_parquet("my_table.parquet", as_data_frame = FALSE)
+head(my_table_arrow_table)
+```
+
+```{r, read_parquet_table_class}
+class(my_table_arrow_table)
+```
+```{r, test_read_parquet_table_class, opts.label = "test"}
+test_that("read_parquet_table_class works as expected", {
+  expect_s3_class(my_table_arrow_table, "Table")
+})
+```
+
+## How to read a (partitioned) Parquet file from S3 
+
+You can open a Parquet file saved on S3 by calling `read_parquet()` and 
passing the relevant URI as the `file` argument.
+
+```{r, read_parquet_s3, eval = FALSE}
+df <- read_parquet(file = "s3://ursa-labs-taxi-data/2019/06/data.parquet")
+```
+For more in-depth instructions, including how to work with S3 buckets which 
require authentication, you can find a guide to reading and writing to/from S3 
buckets here: https://arrow.apache.org/docs/r/articles/fs.html.
+
+## How to filter rows or columns while reading a Parquet file 
+
+When reading in a Parquet file, you can specify which columns to read in via 
the `col_select` argument.
+
+```{r, read_parquet_filter}
+# Create table to read back in 
+dist_time <- Table$create(tibble::tibble(distance = c(12.2, 15.7, 14.2), time 
= c(43, 44, 40)))
+# Write to Parquet
+write_parquet(dist_time, "dist_time.parquet")
+
+# Read in only the "time" column
+time_only <- read_parquet("dist_time.parquet", col_select = "time")
+head(time_only)
+```
+```{r, test_read_parquet_filter, opts.label = "test"}
+test_that("read_parquet_filter works as expected", {
+  expect_identical(time_only, tibble::tibble(time = c(43, 44, 40)))
+})
+```
+
+## Reading and Writing CSV files 
+
+You can use `write_csv_arrow()` to save an Arrow Table to disk as a CSV.
+
+```{r, write_csv_arrow}
+write_csv_arrow(cars, "cars.csv")
+```
+```{r, test_write_csv_arrow, opts.label = "test"}
+test_that("write_csv_arrow chunk works as expected", {
+  expect_true(file.exists("cars.csv"))
+})
+```
+
+You can use `read_csv_arrow()` to read in a CSV file as an Arrow Table.
+
+```{r, read_csv_arrow}
+my_csv <- read_csv_arrow("cars.csv")

Review comment:
       ```suggestion
   my_csv <- read_csv_arrow("cars.csv", as_data_frame = FALSE)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-cookbook] ianmcook commented on a change in pull request #1: Initial content for Arrow Cookbook for Python and R

Reply via email to