[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

GitBox Mon, 21 Nov 2022 15:41:00 -0800


djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028612461



##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called 
the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud 
storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset 
article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding 
CSV 
+file would have been. This is in part due to the use of file compression: by 
default, 
+Parquet files written with the `arrow` package use [Snappy 
compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table 
instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and 
contain metadata that allow file readers to skip to the relevant sections of 
the file. That means it is possible to load only a subset of the columns 
without reading the complete file. The `col_select` argument to 
`read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the 
[metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file 
format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by 
default, and supports multiple kinds of file compression. Basic use is shown 
below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading 
feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of 
columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, 
`read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 
+
+Although `read_csv_arrow()` currently has fewer parsing options for dealing 
+with every CSV format variation in the wild than other CSV readers available
+in R, for those files that it can read, it is often significantly faster than 
+other R CSV readers, such as `base::read.csv`, `readr::read_csv`, and
+`data.table::fread`.

Review Comment:
   I would prefer not to. It's another one where this copy is preserved from 
the existing docs (it's currently here: 
https://arrow.apache.org/docs/r/articles/arrow.html) and I personally think 
it's unwise. I don't think it helps the community to make benchmarking claims 
in the documentation. That feels like something to do elsewhere? I'd be very 
happy to delete this actually



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Reply via email to