[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

GitBox Mon, 07 Nov 2022 18:25:46 -0800


djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1016082278



##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img 
src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png"; 
align="right" alt="" width="120" />
 
 
[![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-**The `arrow` package exposes an interface to the Arrow C++ library,
-enabling access to many of its features in R.** It provides low-level
+The `arrow` R package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R. It provides low-level
 access to the Arrow C++ library API and higher-level access through a
 `{dplyr}` backend and familiar R functions.
 
 ## What can the `arrow` package do?
 
--   Read and write **Parquet files** (`read_parquet()`,
-    `write_parquet()`), an efficient and widely used columnar format
--   Read and write **Feather files** (`read_feather()`,
-    `write_feather()`), a format optimized for speed and
-    interoperability
--   Analyze, process, and write **multi-file, larger-than-memory
-    datasets** (`open_dataset()`, `write_dataset()`)
--   Read **large CSV and JSON files** with excellent **speed and
-    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
--   Write CSV files (`write_csv_arrow()`)
--   Manipulate and analyze Arrow data with **`dplyr` verbs**
--   Read and write files in **Amazon S3** and **Google Cloud Storage**
-    buckets with no additional function calls
--   Exercise **fine control over column types** for seamless
-    interoperability with databases and data warehouse systems
--   Use **compression codecs** including Snappy, gzip, Brotli,
-    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
--   Enable **zero-copy data sharing** between **R and Python**
--   Connect to **Arrow Flight** RPC servers to send and receive large
-    datasets over networks
--   Access and manipulate Arrow objects through **low-level bindings**
-    to the C++ library
--   Provide a **toolkit for building connectors** to other applications
-    and services that use Arrow
-
-## Installation
+The `arrow` package provides functionality for a wide range of data analysis
+tasks. It allows users to read and write data in a variety formats:
 
-### Installing the latest release version
-
-Install the latest release of `arrow` from CRAN with
-
-``` r
-install.packages("arrow")
-```
+-   Read and write Parquet files, an efficient and widely used columnar format
+-   Read and write Feather files, a format optimized for speed and
+    interoperability
+-   Read and write CSV files with excellent speed and efficiency
+-   Read and write multi-file larger-than-memory datasets
+-   Read JSON files
 
-Conda users can install `arrow` from conda-forge with
+It provides data analysis tools for both in-memory and larger-than-memory data 
sets
 
-``` shell
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+-   Analyze and process larger-than-memory datasets
+-   Manipulate and analyze Arrow data with `dplyr` verbs
 
-Installing a released version of the `arrow` package requires no
-additional system dependencies. For macOS and Windows, CRAN hosts binary
-packages that contain the Arrow C++ library. On Linux, source package
-installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable
-`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for details.
+It provides access to remote filesystems and servers
 
-As of version 10.0.0, `arrow` requires C++17 to build. This means that:
+-   Read and write files in Amazon S3 and Google Cloud Storage buckets
+-   Connect to Arrow Flight servers to transport large datasets over networks  
+    
+Additional features include:
 
-* On Windows, you need `R >= 4.0`. Version 9.0.0 was the last version to 
support
-R 3.6.
-* On CentOS 7, you can build the latest version of `arrow`,
-but you first need to install a newer compiler than the default system 
compiler,
-gcc 4.8. See `vignette("install", package = "arrow")` for guidance.
-Note that you only need the newer compiler to build `arrow`:
-installing a binary package, as from RStudio Package Manager,
-or loading a package you've already installed works fine with the system 
defaults.
+-   Zero-copy data sharing between R and Python
+-   Fine control over column types to work seamlessly
+    with databases and data warehouses
+-   Support for compression codecs including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2
+-   Access and manipulate Arrow objects through low-level bindings
+    to the C++ library
+-   Toolkit for building connectors to other applications
+    and services that use Arrow
 
-### Installing a development version
+## Installation
 
-Development versions of the package (binary and source) are built
-nightly and hosted at <https://nightlies.apache.org/arrow/r/>. To
-install from there:
+Most R users will probably want to install the latest release of `arrow` 
+from CRAN:
 
 ``` r
-install.packages("arrow", repos = c(arrow = 
"https://nightlies.apache.org/arrow/r";, getOption("repos")))
+install.packages("arrow")
 ```
 
-Conda users can install `arrow` nightly builds with
+Alternatively, if you are using conda you can install `arrow` from conda-forge:
 
 ``` shell
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority 
r-arrow
-```
-
-If you already have a version of `arrow` installed, you can switch to
-the latest nightly development version with
-
-``` r
-arrow::install_arrow(nightly = TRUE)
-```
-
-These nightly package builds are not official Apache releases and are
-not recommended for production use. They may be useful for testing bug
-fixes and new features under active development.
-
-## Usage
-
-Among the many applications of the `arrow` package, two of the most accessible 
are:
-
--   High-performance reading and writing of data files with multiple
-    file formats and compression codecs, including built-in support for
-    cloud storage
--   Analyzing and manipulating bigger-than-memory data with `dplyr`
-    verbs
-
-The sections below describe these two uses and illustrate them with
-basic examples. The sections below mention two Arrow data structures:
-
--   `Table`: a tabular, column-oriented data structure capable of
-    storing and processing large amounts of data more efficiently than
-    R’s built-in `data.frame` and with SQL-like column data types that
-    afford better interoperability with databases and data warehouse
-    systems
--   `Dataset`: a data structure functionally similar to `Table` but with
-    the capability to work on larger-than-memory data partitioned across
-    multiple files
-
-### Reading and writing data files with `arrow`
-
-The `arrow` package provides functions for reading single data files in
-several common formats. By default, calling any of these functions
-returns an R `data.frame`. To return an Arrow `Table`, set argument
-`as_data_frame = FALSE`.
-
--   `read_parquet()`: read a file in Parquet format
--   `read_feather()`: read a file in Feather format (the Apache Arrow
-    IPC format)
--   `read_delim_arrow()`: read a delimited text file (default delimiter
-    is comma)
--   `read_csv_arrow()`: read a comma-separated values (CSV) file
--   `read_tsv_arrow()`: read a tab-separated values (TSV) file
--   `read_json_arrow()`: read a JSON data file
-
-For writing data to single files, the `arrow` package provides the
-functions `write_parquet()`, `write_feather()`, and `write_csv_arrow()`.
-These can be used with R `data.frame` and Arrow `Table` objects.
-
-For example, let’s write the Star Wars characters data that’s included
-in `dplyr` to a Parquet file, then read it back in. Parquet is a popular
-choice for storing analytic data; it is optimized for reduced file sizes
-and fast read performance, especially for column-based access patterns.
-Parquet is widely supported by many tools and platforms.
-
-First load the `arrow` and `dplyr` packages:
-
-``` r
-library(arrow, warn.conflicts = FALSE)
-library(dplyr, warn.conflicts = FALSE)
-```
-
-Then write the `data.frame` named `starwars` to a Parquet file at
-`file_path`:
-
-``` r
-file_path <- tempfile()
-write_parquet(starwars, file_path)
-```
-
-Then read the Parquet file into an R `data.frame` named `sw`:
-
-``` r
-sw <- read_parquet(file_path)
-```
-
-R object attributes are preserved when writing data to Parquet or
-Feather files and when reading those files back into R. This enables
-round-trip writing and reading of `sf::sf` objects, R `data.frame`s with
-with `haven::labelled` columns, and `data.frame`s with other custom
-attributes.
-
-For reading and writing larger files or sets of multiple files, `arrow`
-defines `Dataset` objects and provides the functions `open_dataset()`
-and `write_dataset()`, which enable analysis and processing of
-bigger-than-memory data, including the ability to partition data into
-smaller chunks without loading the full data into memory. For examples
-of these functions, see `vignette("dataset", package = "arrow")`.
-
-All these functions can read and write files in the local filesystem or
-in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more
-details, see `vignette("fs", package = "arrow")`
-
-### Using `dplyr` with `arrow`
-
-The `arrow` package provides a `dplyr` backend enabling manipulation of
-Arrow tabular data with `dplyr` verbs. To use it, first load both
-packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
-`Dataset` object. For example, read the Parquet file written in the
-previous example into an Arrow `Table` named `sw`:
-
-``` r
-sw <- read_parquet(file_path, as_data_frame = FALSE)
-```
-
-Next, pipe on `dplyr` verbs:
-
-``` r
-result <- sw %>%
-  filter(homeworld == "Tatooine") %>%
-  rename(height_cm = height, mass_kg = mass) %>%
-  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
-  arrange(desc(birth_year)) %>%
-  select(name, height_in, mass_lbs)
-```
-
-The `arrow` package uses lazy evaluation to delay computation until the
-result is required. This speeds up processing by enabling the Arrow C++
-library to perform multiple computations in one operation. `result` is
-an object with class `arrow_dplyr_query` which represents all the
-computations to be performed:
-
-``` r
-result
-#> Table (query)
-#> name: string
-#> height_in: expr
-#> mass_lbs: expr
-#>
-#> * Filter: equal(homeworld, "Tatooine")
-#> * Sorted by birth_year [desc]
-#> See $.data for the source Arrow object
-```
-
-To perform these computations and materialize the result, call
-`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
-suitable for passing to other `arrow` or `dplyr` functions:
-
-``` r
-result %>% compute()
-#> Table
-#> 10 rows x 3 columns
-#> $name <string>
-#> $height_in <double>
-#> $mass_lbs <double>
-```
-
-`collect()` returns an R `data.frame`, suitable for viewing or passing
-to other R functions for analysis or visualization:
-
-``` r
-result %>% collect()
-#> # A tibble: 10 x 3
-#>    name               height_in mass_lbs
-#>    <chr>                  <dbl>    <dbl>
-#>  1 C-3PO                   65.7    165.
-#>  2 Cliegg Lars             72.0     NA
-#>  3 Shmi Skywalker          64.2     NA
-#>  4 Owen Lars               70.1    265.
-#>  5 Beru Whitesun lars      65.0    165.
-#>  6 Darth Vader             79.5    300.
-#>  7 Anakin Skywalker        74.0    185.
-#>  8 Biggs Darklighter       72.0    185.
-#>  9 Luke Skywalker          67.7    170.
-#> 10 R5-D4                   38.2     70.5
+conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
-The `arrow` package works with most single-table `dplyr` verbs, including those
-that compute aggregates.
+In most cases installing the latest release should "just work" without 

Review Comment:
   I decided to simplify it and say "work" without quotes or the "just" 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Reply via email to