[GitHub] [arrow] nealrichardson commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

GitBox Wed, 14 Apr 2021 10:01:12 -0700


nealrichardson commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613420800




##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and

Review comment:
       we could link to our blog post on CSV reading

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function

Review comment:
       I'm not sure "open" is meaningful here, nor single function call. I 
think the point is that you can treat a directory of many files as a single 
dataset. 
   
   For writing, the point is that you can split your data into multiple files 
in ways that improve the speed with which you can query.

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability

Review comment:
       One could argue that Parquet is also fast and interoperable. Perhaps 
have a look at the Arrow website FAQ (and maybe link to it here with "When 
should I use Parquet vs. Feather?" or something?)

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading
+    and writing data files
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Manipulate and analyze **larger-than-memory datasets** with
+    **`dplyr` verbs**
+-   Pass data between **R and Python** in the same process

Review comment:
       Might want to say "zero-copy" here somewhere

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading
+    and writing data files
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Manipulate and analyze **larger-than-memory datasets** with

Review comment:
       I'd fold this into the earlier statement about datasets

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading

Review comment:
       This one isn't clear to me

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function

Review comment:
       This could also link to the dataset vignette.

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading
+    and writing data files
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Manipulate and analyze **larger-than-memory datasets** with
+    **`dplyr` verbs**
+-   Pass data between **R and Python** in the same process
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version

Review comment:
       Maybe delete this heading?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Reply via email to