thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1027992996


##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called 
the Feather format)

Review Comment:
   ```suggestion
   - `read_feather()`: read a file in the Apache Arrow IPC format (formerly 
called the Feather format)
   ```



##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.

Review Comment:
   What do you think to the idea of mentioning in this paragraph that the file 
reader functions read things into memory?  I've been in a few discussions 
lately where it's been mentioned that the eager/lazy distinction for the 
`read_*()` functions as compared to `open_dataset()` is a more accurate one 
than single/multi-files (you can read in a single file via the datasets API), 
and whilst we haven't started pushing that narrative yet, here could be a good 
place to start?



##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called 
the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud 
storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset 
article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding 
CSV 
+file would have been. This is in part due to the use of file compression: by 
default, 
+Parquet files written with the `arrow` package use [Snappy 
compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table 
instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and 
contain metadata that allow file readers to skip to the relevant sections of 
the file. That means it is possible to load only a subset of the columns 
without reading the complete file. The `col_select` argument to 
`read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the 
[metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file 
format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by 
default, and supports multiple kinds of file compression. Basic use is shown 
below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading 
feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of 
columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, 
`read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 
+
+Although `read_csv_arrow()` currently has fewer parsing options for dealing 
+with every CSV format variation in the wild than other CSV readers available
+in R, for those files that it can read, it is often significantly faster than 
+other R CSV readers, such as `base::read.csv`, `readr::read_csv`, and
+`data.table::fread`.

Review Comment:
   Lol, are we going there? :P  



##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called 
the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud 
storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset 
article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding 
CSV 
+file would have been. This is in part due to the use of file compression: by 
default, 
+Parquet files written with the `arrow` package use [Snappy 
compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table 
instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and 
contain metadata that allow file readers to skip to the relevant sections of 
the file. That means it is possible to load only a subset of the columns 
without reading the complete file. The `col_select` argument to 
`read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the 
[metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file 
format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 

Review Comment:
   I like the addition of this context for why it's no longer called Feather, 
but why it was.



##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called 
the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud 
storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset 
article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding 
CSV 
+file would have been. This is in part due to the use of file compression: by 
default, 
+Parquet files written with the `arrow` package use [Snappy 
compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table 
instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and 
contain metadata that allow file readers to skip to the relevant sections of 
the file. That means it is possible to load only a subset of the columns 
without reading the complete file. The `col_select` argument to 
`read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the 
[metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file 
format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by 
default, and supports multiple kinds of file compression. Basic use is shown 
below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading 
feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of 
columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, 
`read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 

Review Comment:
   Is it worth mentioning that for advanced usage, users can pass 
Arrow-specific arguments through in the `parse_options`, `read_options`, and 
`convert_options` parameters (or at least acknowledging that users have access 
to the Arrow-specific options even if not explicitly mentioning these 
parameters)?  (I'm anticipating that the answer to this question may well be 
'no'!)



##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called 
the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud 
storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset 
article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding 
CSV 
+file would have been. This is in part due to the use of file compression: by 
default, 
+Parquet files written with the `arrow` package use [Snappy 
compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table 
instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and 
contain metadata that allow file readers to skip to the relevant sections of 
the file. That means it is possible to load only a subset of the columns 
without reading the complete file. The `col_select` argument to 
`read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the 
[metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file 
format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by 
default, and supports multiple kinds of file compression. Basic use is shown 
below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading 
feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of 
columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, 
`read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 
+
+Although `read_csv_arrow()` currently has fewer parsing options for dealing 
+with every CSV format variation in the wild than other CSV readers available
+in R, for those files that it can read, it is often significantly faster than 
+other R CSV readers, such as `base::read.csv`, `readr::read_csv`, and
+`data.table::fread`.
+
+A simple example of writing and reading a CSV file with `arrow` is shown below:
+
+```{r}
+file_path <- tempfile()
+write_csv_arrow(mtcars, file_path)
+read_csv_arrow(file_path, col_select = starts_with("d"))
+```
+
+## JSON format
+
+The `arrow` package supports reading (but not writing) of tabular data from 
line-delimited JSON, using the `read_json_arrow()` function. A minimal example 
is shown below:
+
+```{r}
+file_path <- tempfile()
+writeLines('
+    { "hello": 3.5, "world": false, "yo": "thing" }
+    { "hello": 3.25, "world": null }
+    { "hello": 0.0, "world": true, "yo": null }
+  ', file_path, useBytes = TRUE)
+read_json_arrow(file_path)
+```
+
+## Further reading
+
+- To learn more about cloud storage, see the [cloud storage 
article](./fs.html).
+- To learn more about multi-file datasets, see the [datasets 
article](./dataset.html).

Review Comment:
   Perhaps link to cookbook chapters?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to