[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

GitBox Tue, 13 Apr 2021 11:56:39 -0700


ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612700647




##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ 
library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and 
hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com";)
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly 
development version, you can use the included `install_arrow()` utility 
function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority 
r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority 
r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly 
repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside 
`cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to 
create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in 
`help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, 
type)`              |
+| `Schema`   | list of `Field`s                                 | 
`schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory 
allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to 
create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | 
`Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | 
`Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | 
`ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | 
`RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | 
`Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see 
`vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library 
point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, 
if you have a system version of a C++ dependency that doesn't work correctly 
with Arrow. This tells the build to compile its own version of the dependency 
from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/";)
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To begin, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the object
+creation or file loading functions listed above. For example, create a
+`Table` named `sw` with the Star Wars characters data frame that’s
+included in `dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, 
DLLpath = DLLpath, ...):
-    unable to load shared object 
'/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not 
loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
-
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+write_parquet(starwars, data_file <- tempfile()) # write file to demonstrate 
reading it
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ 
libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are 
available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with

Review comment:
       I think I'll use "until the result is required" instead of "until if 
that is requested" if that's cool with you




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Reply via email to