djnavarro commented on code in PR #14514: URL: https://github.com/apache/arrow/pull/14514#discussion_r1028612461
########## r/vignettes/read_write.Rmd: ########## @@ -0,0 +1,164 @@ +--- +title: "Reading and writing data files" +description: > + Learn how to read and write CSV, Parquet, and Feather files with `arrow` +output: rmarkdown::html_vignette +--- + +The `arrow` package provides functions for reading single data files in +several common formats. By default, calling any of these functions +returns an R data frame. To return an Arrow Table, set argument +`as_data_frame = FALSE`. + +- `read_parquet()`: read a file in Parquet format +- `read_feather()`: read a file in the Apache Arrow IPC format (also called the Feather format) +- `read_delim_arrow()`: read a delimited text file (default delimiter is comma) +- `read_csv_arrow()`: read a comma-separated values (CSV) file +- `read_tsv_arrow()`: read a tab-separated values (TSV) file +- `read_json_arrow()`: read a JSON data file + +For writing data to single files, the `arrow` package provides the +following functions, which can be used with both R data frames and +Arrow Tables: + +- `write_parquet()`: write a file in Parquet format +- `write_feather()`: write a file in Arrow IPC format +- `write_csv_arrow()`: write a file in CSV format + +All these functions can read and write files in the local filesystem or +to cloud storage. For more on cloud storage support in `arrow`, see the [cloud storage article](./fs.html). + +The `arrow` package also supports reading and writing multi-file datasets, +which enable analysis and processing of larger-than-memory data, and provide +the ability to partition data into smaller chunks without loading the full +data into memory. For more information on this topic, see the [dataset article](./dataset.html). + +## Parquet format + +[Apache Parquet](https://parquet.apache.org/) is a popular +choice for storing analytics data; it is a binary format that is +optimized for reduced file sizes and fast read performance, especially +for column-based access patterns. The simplest way to read and write +Parquet data using `arrow` is with the `read_parquet()` and +`write_parquet()` functions. To illustrate this, we'll write the +`starwars` data included in `dplyr` to a Parquet file, then read it +back in. First load the `arrow` and `dplyr` packages: + +```{r} +library(arrow, warn.conflicts = FALSE) +library(dplyr, warn.conflicts = FALSE) +``` + +Next we'll write the data frame to a Parquet file located at `file_path`: + +```{r} +file_path <- tempfile() +write_parquet(starwars, file_path) +``` + +The size of a Parquet file is typically much smaller than the corresponding CSV +file would have been. This is in part due to the use of file compression: by default, +Parquet files written with the `arrow` package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip +are also supported. See `help("write_parquet", package = "arrow")` for more +information. + +Having written the Parquet file, we now can read it with `read_parquet()`: + +```{r} +read_parquet(file_path) +``` + +The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`: + +```{r} +read_parquet(file_path, as_data_frame = FALSE) +``` + +One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality is supported in `arrow`: + +```{r} +read_parquet(file_path, col_select = c("name", "height", "mass")) +``` + +R object attributes are preserved when writing data to Parquet or +Arrow/Feather files and when reading those files back into R. This enables +round-trip writing and reading of `sf::sf` objects, R data frames with +with `haven::labelled` columns, and data frame with other custom +attributes. To learn more about how metadata are handled in `arrow`, the [metadata article](./metadata.html). + +## Arrow/Feather format + +The Arrow file format was developed to provide binary columnar +serialization for data frames, to make reading and writing data frames +efficient, and to make sharing data across data analysis languages easy. +This file format is sometimes referred to as Feather because it is an +outgrowth of the original [Feather](https://github.com/wesm/feather) project +that has now been moved into the Arrow project itself. You can find the +detailed specification of version 2 of the Arrow format -- officially +referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) -- +on the Arrow specification page. + +The `write_feather()` function writes version 2 Arrow/Feather files by default, and supports multiple kinds of file compression. Basic use is shown below: + +```{r} +file_path <- tempfile() +write_feather(starwars, file_path) +``` + +The `read_feather()` function provides a familiar interface for reading feather files: + +```{r} +read_feather(file_path) +``` + +Like the Parquet reader, this reader supports reading a only subset of columns, and can produce Arrow Table output: + +```{r} +read_feather( + file = file_path, + col_select = c("name", "height", "mass"), + as_data_frame = FALSE +) +``` + +## CSV format + +The read/write capabilities of the `arrow` package also include support for +CSV and other text-delimited files. The `read_csv_arrow()`, `read_tsv_arrow()`, +and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read +data files, where the Arrow C++ options have been mapped to arguments in a +way that mirrors the conventions used in `readr::read_delim()`, with a +`col_select` argument inspired by `vroom::vroom()`. + +Although `read_csv_arrow()` currently has fewer parsing options for dealing +with every CSV format variation in the wild than other CSV readers available +in R, for those files that it can read, it is often significantly faster than +other R CSV readers, such as `base::read.csv`, `readr::read_csv`, and +`data.table::fread`. Review Comment: I would prefer not to. It's another one where this copy is preserved from the existing docs (it's currently here: https://arrow.apache.org/docs/r/articles/arrow.html) and I personally think it's unwise. I don't think it helps the community to make benchmarking claims in the documentation. That feels like something to do elsewhere? I'd be very happy to delete this actually -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
