[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

GitBox Tue, 13 Apr 2021 11:51:07 -0700


ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612697191




##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ 
library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and 
hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com";)
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly 
development version, you can use the included `install_arrow()` utility 
function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority 
r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority 
r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly 
repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside 
`cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to 
create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in 
`help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, 
type)`              |
+| `Schema`   | list of `Field`s                                 | 
`schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory 
allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to 
create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | 
`Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | 
`Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | 
`ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | 
`RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | 
`Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see 
`vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library 
point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, 
if you have a system version of a C++ dependency that doesn't work correctly 
with Arrow. This tells the build to compile its own version of the dependency 
from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.

Review comment:
       In R, it should be either "data frame" or `data.frame`—don't get me 
started on how it's totally different if we're talking about pandas or Spark...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Reply via email to