ianmcook commented on a change in pull request #10014: URL: https://github.com/apache/arrow/pull/10014#discussion_r612700647
########## File path: r/README.md ########## @@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for analyzing large, multi-file datasets (`open_dataset()`), working with individual Parquet (`read_parquet()`, `write_parquet()`) and Feather -(`read_feather()`, `write_feather()`) files, as well as lower-level -access to Arrow memory and messages. +(`read_feather()`, `write_feather()`) files, as well as a `dplyr` +backend and lower-level access to Arrow memory and messages. ## Installation Install the latest release of `arrow` from CRAN with -```r +``` r install.packages("arrow") ``` Conda users can install `arrow` from conda-forge with -``` +``` shell conda install -c conda-forge --strict-channel-priority r-arrow ``` Installing a released version of the `arrow` package requires no additional system dependencies. For macOS and Windows, CRAN hosts binary packages that contain the Arrow C++ library. On Linux, source package installation will also build necessary C++ dependencies. For a faster, -more complete installation, set the environment variable `NOT_CRAN=true`. -See `vignette("install", package = "arrow")` for details. +more complete installation, set the environment variable +`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for +details. ## Installing a development version -Development versions of the package (binary and source) are built daily and hosted at -<https://arrow-r-nightly.s3.amazonaws.com>. To install from there: +Development versions of the package (binary and source) are built +nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To +install from there: ``` r install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com") ``` -Or +If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function: -```r +``` r arrow::install_arrow(nightly = TRUE) ``` -Conda users can install `arrow` nightlies from our nightlies channel using: - -``` -conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow -``` - -These daily package builds are not official Apache releases and are not -recommended for production use. They may be useful for testing bug fixes -and new features under active development. - -## Developing - -Windows and macOS users who wish to contribute to the R package and -don’t need to alter the Arrow C++ library may be able to obtain a -recent version of the library without building from source. On macOS, -you may install the C++ library using [Homebrew](https://brew.sh/): +Conda users can install `arrow` nightly builds with ``` shell -# For the released version: -brew install apache-arrow -# Or for a development version, you can try: -brew install apache-arrow --HEAD +conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow ``` -On Windows, you can download a .zip file with the arrow dependencies from the -[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/), -and then set the `RWINLIB_LOCAL` environment variable to point to that -zip file before installing the `arrow` R package. Version numbers in that -repository correspond to dates, and you will likely want the most recent. +These nightly package builds are not official Apache releases and are +not recommended for production use. They may be useful for testing bug +fixes and new features under active development. -If you need to alter both the Arrow C++ library and the R package code, -or if you can’t get a binary version of the latest C++ library -elsewhere, you’ll need to build it from source too. +## Apache Arrow metadata and data objects -First, install the C++ library. See the [developer -guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details. -It's recommended to make a `build` directory inside of the `cpp` directory of -the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`, -you'll first call `cmake` to configure the build and then `make install`. -For the R package, you'll need to enable several features in the C++ library -using `-D` flags: +Arrow defines the following classes for representing metadata: -``` -cmake \ - -DARROW_COMPUTE=ON \ - -DARROW_CSV=ON \ - -DARROW_DATASET=ON \ - -DARROW_FILESYSTEM=ON \ - -DARROW_JEMALLOC=ON \ - -DARROW_JSON=ON \ - -DARROW_PARQUET=ON \ - -DCMAKE_BUILD_TYPE=release \ - -DARROW_INSTALL_NAME_RPATH=OFF \ - .. -``` +| Class | Description | How to create an instance | +|------------|--------------------------------------------------|----------------------------------| +| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` | +| `Field` | a character string name and a `DataType` | `field(name, type)` | +| `Schema` | list of `Field`s | `schema(...)` | -where `..` is the path to the `cpp/` directory when you're in `cpp/build`. +Arrow defines the following classes for representing zero-dimensional +(scalar), one-dimensional (array/vector-like), and two-dimensional +(tabular/data frame-like) data: -To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags: +| Dim | Class | Description | How to create an instance | +|-----|----------------|-----------------------------------------|---------------------------------------------------------------| +| 0 | `Scalar` | single value and its `DataType` | `Scalar$create(value, type)` | +| 1 | `Array` | vector of values and its `DataType` | `Array$create(vector, type)` | +| 1 | `ChunkedArray` | vectors of values and their `DataType` | `ChunkedArray$create(..., type)` | +| 2 | `RecordBatch` | list of `Array`s with a `Schema` | `RecordBatch$create(...)` | +| 2 | `Table` | list of `ChunkedArray` with a `Schema` | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` | +| 2 | `Dataset` | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")` | -``` - -DARROW_S3=ON \ - -DARROW_MIMALLOC=ON \ - -DARROW_WITH_BROTLI=ON \ - -DARROW_WITH_BZ2=ON \ - -DARROW_WITH_LZ4=ON \ - -DARROW_WITH_SNAPPY=ON \ - -DARROW_WITH_ZLIB=ON \ - -DARROW_WITH_ZSTD=ON \ -``` +Each of these is defined as an `R6` class in the `arrow` R package and +corresponds to a class of the same name in the Arrow C++ library. The +`arrow` package provides a variety of `R6` and S3 methods for +interacting with instances of these classes. -Other flags that may be useful: +## Reading and writing data files with Arrow -* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers -* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source. +The `arrow` package provides functions for reading data from several +common file formats. By default, calling any of these functions returns +an R data frame. To return an Arrow `Table`, set argument +`as_data_frame = FALSE`. -Note that after any change to the C++ library, you must reinstall it and -run `make clean` or `git clean -fdx .` to remove any cached object code -in the `r/src/` directory before reinstalling the R package. This is -only necessary if you make changes to the C++ library source; you do not -need to manually purge object files if you are only editing R or C++ -code inside `r/`. +- `read_parquet()`: read a file in Parquet format (an efficient + columnar data format) +- `read_feather()`: read a file in Feather format (the Apache Arrow + IPC format) +- `read_delim_arrow()`: read a delimited text file (default delimiter + is comma) +- `read_csv_arrow()`: read a comma-separated values (CSV) file +- `read_tsv_arrow()`: read a tab-separated values (TSV) file +- `read_json_arrow()`: read a JSON data file -Once you’ve built the C++ library, you can install the R package and its -dependencies, along with additional dev dependencies, from the git -checkout: +For writing Arrow tabular data structures to files, the `arrow` package +provides the functions `write_parquet()` and `write_feather()`. These +functions also accept R data frames. -``` shell -cd ../../r +## Using dplyr with Arrow -Rscript -e ' -options(repos = "https://cloud.r-project.org/") -if (!require("remotes")) install.packages("remotes") -remotes::install_deps(dependencies = TRUE) -' +The `arrow` package provides a `dplyr` backend, enabling manipulation of +Arrow tabular data with `dplyr` verbs. To begin, load both `arrow` and +`dplyr`: -R CMD INSTALL . +``` r +library(arrow, warn.conflicts = FALSE) +library(dplyr, warn.conflicts = FALSE) ``` -If you need to set any compilation flags while building the C++ -extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For -example, if you are using `perf` to profile the R extensions, you may -need to set +Then create an Arrow `Table` or `RecordBatch` using one of the object +creation or file loading functions listed above. For example, create a +`Table` named `sw` with the Star Wars characters data frame that’s +included in `dplyr`: -``` shell -export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer +``` r +sw <- Table$create(starwars) ``` -If the package fails to install/load with an error like this: +Or read the same data from a Parquet file, using `as_data_frame = FALSE` +to create a `Table` named `sw`: - ** testing if installed package can be loaded from temporary location - Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...): - unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so': - dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib - -ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on -macOS to prevent problems at link time and is a no-op on other platforms). -Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to -wherever Arrow C++ was put in `make install`, e.g. `export -R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package. - -When installing from source, if the R and C++ library versions do not -match, installation may fail. If you’ve previously installed the -libraries and want to upgrade the R package, you’ll need to update the -Arrow C++ library first. - -For any other build/configuration challenges, see the [C++ developer -guide](https://arrow.apache.org/docs/developers/cpp/building.html) and -`vignette("install", package = "arrow")`. - -### Editing C++ code - -The `arrow` package uses some customized tools on top of `cpp11` to -prepare its C++ code in `src/`. If you change C++ code in the R package, -you will need to set the `ARROW_R_DEV` environment variable to `TRUE` -(optionally, add it to your`~/.Renviron` file to persist across -sessions) so that the `data-raw/codegen.R` file is used for code -generation. - -We use Google C++ style in our C++ code. Check for style errors with - - ./lint.sh - -Fix any style issues before committing with - - ./lint.sh --fix +``` r +write_parquet(starwars, data_file <- tempfile()) # write file to demonstrate reading it +sw <- read_parquet(data_file, as_data_frame = FALSE) +``` -The lint script requires Python 3 and `clang-format-8`. If the command -isn’t found, you can explicitly provide the path to it like -`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get -this by installing LLVM via Homebrew and running the script as -`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh` +For larger or multi-file datasets, load the data into a `Dataset` as +described in `vignette("dataset", package = "arrow")`. -### Running tests +Next, pipe on `dplyr` verbs: -Some tests are conditionally enabled based on the availability of certain -features in the package build (S3 support, compression libraries, etc.). -Others are generally skipped by default but can be enabled with environment -variables or other settings: +``` r +result <- sw %>% + filter(homeworld == "Tatooine") %>% + rename(height_cm = height, mass_kg = mass) %>% + mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>% + arrange(desc(birth_year)) %>% + select(name, height_in, mass_lbs) +``` -* All tests are skipped on Linux if the package builds without the C++ libarrow. - To make the build fail if libarrow is not available (as in, to test that - the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE` -* Some tests are disabled unless `ARROW_R_DEV=TRUE` -* Tests that require allocating >2GB of memory to test Large types are disabled - unless `ARROW_LARGE_MEMORY_TESTS=TRUE` -* Integration tests against a real S3 bucket are disabled unless credentials - are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available - on request -* S3 tests using [MinIO](https://min.io/) locally are enabled if the - `minio server` process is found running. If you're running MinIO with custom - settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and - `MINIO_PORT` to override the defaults. +The `arrow` package uses lazy evaluation. `result` is an object with Review comment: I think I'll use "until the result is required" instead of "until if that is requested" if that's cool with you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org