ianmcook commented on a change in pull request #10014: URL: https://github.com/apache/arrow/pull/10014#discussion_r613496829
########## File path: r/README.md ########## @@ -4,250 +4,283 @@ [](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush) [](https://anaconda.org/conda-forge/r-arrow) -[Apache Arrow](https://arrow.apache.org/) is a cross-language -development platform for in-memory data. It specifies a standardized +**[Apache Arrow](https://arrow.apache.org/) is a cross-language +development platform for in-memory data.** It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. -The `arrow` package exposes an interface to the Arrow C++ library to -access many of its features in R. This includes support for analyzing -large, multi-file datasets (`open_dataset()`), working with individual -Parquet (`read_parquet()`, `write_parquet()`) and Feather -(`read_feather()`, `write_feather()`) files, as well as lower-level -access to Arrow memory and messages. +**The `arrow` package exposes an interface to the Arrow C++ library, +enabling access to many of its features in R.** It provides low-level +access to the Arrow C++ library API and higher-level access through a +`dplyr` backend and familiar R functions. + +## What can the `arrow` package do? + +- Read and write **Parquet files** (`read_parquet()`, + `write_parquet()`), an efficient and widely used columnar format +- Read and write **Feather files** (`read_feather()`, + `write_feather()`), a format optimized for speed and + interoperability +- Open or write **large, multi-file datasets** with a single function + call (`open_dataset()`, `write_dataset()`) +- Read **large CSV and JSON files** with excellent **speed and + efficiency** (`read_csv_arrow()`, `read_json_arrow()`) +- Read and write files in **Amazon S3** buckets with no additional + function calls +- Exercise **full control over data types** of columns when reading + and writing data files +- Use **compression codecs** including Snappy, gzip, Brotli, + Zstandard, LZ4, LZO, and bzip2 for reading and writing data +- Manipulate and analyze **larger-than-memory datasets** with + **`dplyr` verbs** +- Pass data between **R and Python** in the same process +- Connect to **Arrow Flight** RPC servers to send and receive large + datasets over networks +- Access and manipulate Arrow objects through **low-level bindings** + to the C++ library +- Provide a **toolkit for building connectors** to other applications + and services that use Arrow ## Installation Install the latest release of `arrow` from CRAN with -```r +``` r install.packages("arrow") ``` Conda users can install `arrow` from conda-forge with -``` +``` shell conda install -c conda-forge --strict-channel-priority r-arrow ``` Installing a released version of the `arrow` package requires no additional system dependencies. For macOS and Windows, CRAN hosts binary packages that contain the Arrow C++ library. On Linux, source package installation will also build necessary C++ dependencies. For a faster, -more complete installation, set the environment variable `NOT_CRAN=true`. -See `vignette("install", package = "arrow")` for details. +more complete installation, set the environment variable +`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for +details. ## Installing a development version -Development versions of the package (binary and source) are built daily and hosted at -<https://arrow-r-nightly.s3.amazonaws.com>. To install from there: +Development versions of the package (binary and source) are built +nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To +install from there: ``` r install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com") ``` -Or - -```r -arrow::install_arrow(nightly = TRUE) -``` +Conda users can install `arrow` nightly builds with -Conda users can install `arrow` nightlies from our nightlies channel using: - -``` +``` shell conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow ``` -These daily package builds are not official Apache releases and are not -recommended for production use. They may be useful for testing bug fixes -and new features under active development. - -## Developing - -Windows and macOS users who wish to contribute to the R package and -don’t need to alter the Arrow C++ library may be able to obtain a -recent version of the library without building from source. On macOS, -you may install the C++ library using [Homebrew](https://brew.sh/): +If you already have a version of `arrow` installed, you can switch to +the latest nightly development version with -``` shell -# For the released version: -brew install apache-arrow -# Or for a development version, you can try: -brew install apache-arrow --HEAD +``` r +arrow::install_arrow(nightly = TRUE) ``` -On Windows, you can download a .zip file with the arrow dependencies from the -[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/), -and then set the `RWINLIB_LOCAL` environment variable to point to that -zip file before installing the `arrow` R package. Version numbers in that -repository correspond to dates, and you will likely want the most recent. - -If you need to alter both the Arrow C++ library and the R package code, -or if you can’t get a binary version of the latest C++ library -elsewhere, you’ll need to build it from source too. - -First, install the C++ library. See the [developer -guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details. -It's recommended to make a `build` directory inside of the `cpp` directory of -the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`, -you'll first call `cmake` to configure the build and then `make install`. -For the R package, you'll need to enable several features in the C++ library -using `-D` flags: +These nightly package builds are not official Apache releases and are +not recommended for production use. They may be useful for testing bug +fixes and new features under active development. -``` -cmake \ - -DARROW_COMPUTE=ON \ - -DARROW_CSV=ON \ - -DARROW_DATASET=ON \ - -DARROW_FILESYSTEM=ON \ - -DARROW_JEMALLOC=ON \ - -DARROW_JSON=ON \ - -DARROW_PARQUET=ON \ - -DCMAKE_BUILD_TYPE=release \ - -DARROW_INSTALL_NAME_RPATH=OFF \ - .. -``` +## Usage -where `..` is the path to the `cpp/` directory when you're in `cpp/build`. +Among the many uses of the `arrow` package, two of the most accessible +uses are: -To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags: +- Reading and writing data files +- Manipulating Arrow data with `dplyr` verbs -``` - -DARROW_S3=ON \ - -DARROW_MIMALLOC=ON \ - -DARROW_WITH_BROTLI=ON \ - -DARROW_WITH_BZ2=ON \ - -DARROW_WITH_LZ4=ON \ - -DARROW_WITH_SNAPPY=ON \ - -DARROW_WITH_ZLIB=ON \ - -DARROW_WITH_ZSTD=ON \ -``` +The sections below describe these two uses and illustrate them with +basic examples. The sections below mention two Arrow data structures: -Other flags that may be useful: +- `Table`: a tabular, column-oriented data structure capable of + storing and processing large amounts of data more efficiently than + R’s built-in `data.frame` and with SQL-like column data types that + afford better interoperability with databases and data warehouse + systems +- `Dataset`: a data structure functionally similar to `Table` but with + the capability to work on larger-than-memory data partitioned across + multiple files -* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers -* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source. +### Reading and writing data files with `arrow` -Note that after any change to the C++ library, you must reinstall it and -run `make clean` or `git clean -fdx .` to remove any cached object code -in the `r/src/` directory before reinstalling the R package. This is -only necessary if you make changes to the C++ library source; you do not -need to manually purge object files if you are only editing R or C++ -code inside `r/`. +The `arrow` package provides functions for reading single data files in +several common formats. By default, calling any of these functions +returns an R `data.frame`. To return an Arrow `Table`, set argument +`as_data_frame = FALSE`. -Once you’ve built the C++ library, you can install the R package and its -dependencies, along with additional dev dependencies, from the git -checkout: +- `read_parquet()`: read a file in Parquet format +- `read_feather()`: read a file in Feather format (the Apache Arrow + IPC format) +- `read_delim_arrow()`: read a delimited text file (default delimiter + is comma) +- `read_csv_arrow()`: read a comma-separated values (CSV) file +- `read_tsv_arrow()`: read a tab-separated values (TSV) file +- `read_json_arrow()`: read a JSON data file -``` shell -cd ../../r +For writing data to single files, the `arrow` package provides the +functions `write_parquet()` and `write_feather()`. These can be used +with R `data.frame` and Arrow `Table` objects. -Rscript -e ' -options(repos = "https://cloud.r-project.org/") -if (!require("remotes")) install.packages("remotes") -remotes::install_deps(dependencies = TRUE) -' +For example, let’s write the Star Wars characters data that’s included +in `dplyr` to a Parquet file, then read it back in. First load the +`arrow` and `dplyr` packages: -R CMD INSTALL . +``` r +library(arrow, warn.conflicts = FALSE) +library(dplyr, warn.conflicts = FALSE) ``` -If you need to set any compilation flags while building the C++ -extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For -example, if you are using `perf` to profile the R extensions, you may -need to set +Then write the `data.frame` named `starwars` to a Parquet file at +`file_path`: -``` shell -export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer +``` r +file_path <- tempfile() +write_parquet(starwars, file_path) ``` -If the package fails to install/load with an error like this: - - ** testing if installed package can be loaded from temporary location - Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...): - unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so': - dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib - -ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on -macOS to prevent problems at link time and is a no-op on other platforms). -Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to -wherever Arrow C++ was put in `make install`, e.g. `export -R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package. +Then read the Parquet file into an R `data.frame` named `sw`: -When installing from source, if the R and C++ library versions do not -match, installation may fail. If you’ve previously installed the -libraries and want to upgrade the R package, you’ll need to update the -Arrow C++ library first. - -For any other build/configuration challenges, see the [C++ developer -guide](https://arrow.apache.org/docs/developers/cpp/building.html) and -`vignette("install", package = "arrow")`. - -### Editing C++ code - -The `arrow` package uses some customized tools on top of `cpp11` to -prepare its C++ code in `src/`. If you change C++ code in the R package, -you will need to set the `ARROW_R_DEV` environment variable to `TRUE` -(optionally, add it to your`~/.Renviron` file to persist across -sessions) so that the `data-raw/codegen.R` file is used for code -generation. +``` r +sw <- read_parquet(file_path) +``` -We use Google C++ style in our C++ code. Check for style errors with +For reading and writing larger files or multiple files with Arrow +`Dataset` objects, `arrow` provides the functions `open_dataset()` and +`write_dataset()`. For examples of these, see +`vignette("dataset", package = "arrow")`. - ./lint.sh +All these functions can read and write files in the local filesystem (by +passing unqualified paths or `file://` URIs) or files in Amazon S3 (by +passing S3 URIs beginning with `s3://`). For more details, see +`vignette("fs", package = "arrow")` -Fix any style issues before committing with +### Using `dplyr` with `arrow` - ./lint.sh --fix +The `arrow` package provides a `dplyr` backend enabling manipulation of +Arrow tabular data with `dplyr` verbs. To use it, first load both +packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or +`Dataset` object. For example, read the Parquet file written in the +previous example into an Arrow `Table` named `sw`: -The lint script requires Python 3 and `clang-format-8`. If the command -isn’t found, you can explicitly provide the path to it like -`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get -this by installing LLVM via Homebrew and running the script as -`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh` +``` r +sw <- read_parquet(file_path, as_data_frame = FALSE) +``` -### Running tests +Next, pipe on `dplyr` verbs: -Some tests are conditionally enabled based on the availability of certain -features in the package build (S3 support, compression libraries, etc.). -Others are generally skipped by default but can be enabled with environment -variables or other settings: +``` r +result <- sw %>% + filter(homeworld == "Tatooine") %>% + rename(height_cm = height, mass_kg = mass) %>% + mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>% + arrange(desc(birth_year)) %>% + select(name, height_in, mass_lbs) +``` -* All tests are skipped on Linux if the package builds without the C++ libarrow. - To make the build fail if libarrow is not available (as in, to test that - the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE` -* Some tests are disabled unless `ARROW_R_DEV=TRUE` -* Tests that require allocating >2GB of memory to test Large types are disabled - unless `ARROW_LARGE_MEMORY_TESTS=TRUE` -* Integration tests against a real S3 bucket are disabled unless credentials - are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available - on request -* S3 tests using [MinIO](https://min.io/) locally are enabled if the - `minio server` process is found running. If you're running MinIO with custom - settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and - `MINIO_PORT` to override the defaults. +The `arrow` package uses lazy evaluation to delay computation until the +result is required. `result` is an object with class `arrow_dplyr_query` +which represents the computations to be performed: -### Useful functions +``` r +result +#> Table (query) +#> name: string +#> height_in: expr +#> mass_lbs: expr +#> +#> * Filter: equal(homeworld, "Tatooine") +#> * Sorted by birth_year [desc] +#> See $.data for the source Arrow object +``` -Within an R session, these can help with package development: +To execute these computations and materialize the result, call +`compute()` or `collect()`. `compute()` returns an Arrow `Table`, +suitable for passing to other `arrow` or `dplyr` functions: ``` r -devtools::load_all() # Load the dev package -devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names -devtools::document() # Update roxygen documentation -pkgdown::build_site() # To preview the documentation website -devtools::check() # All package checks; see also below -covr::package_coverage() # See test coverage statistics +result %>% compute() +#> Table +#> 10 rows x 3 columns +#> $name <string> +#> $height_in <double> +#> $mass_lbs <double> ``` -Any of those can be run from the command line by wrapping them in `R -e -'$COMMAND'`. There’s also a `Makefile` to help with some common tasks -from the command line (`make test`, `make doc`, `make clean`, etc.) - -### Full package validation +`collect()` returns an R `data.frame`, suitable for viewing or passing +to other R functions for analysis or visualization: -``` shell -R CMD build . -R CMD check arrow_*.tar.gz --as-cran +``` r +result %>% collect() +#> # A tibble: 10 x 3 +#> name height_in mass_lbs +#> <chr> <dbl> <dbl> +#> 1 C-3PO 65.7 165. +#> 2 Cliegg Lars 72.0 NA +#> 3 Shmi Skywalker 64.2 NA +#> 4 Owen Lars 70.1 265. +#> 5 Beru Whitesun lars 65.0 165. +#> 6 Darth Vader 79.5 300. +#> 7 Anakin Skywalker 74.0 185. +#> 8 Biggs Darklighter 72.0 185. +#> 9 Luke Skywalker 67.7 170. +#> 10 R5-D4 38.2 70.5 ``` + +The `arrow` package works with most `dplyr` verbs except those that +compute aggregates (such as `summarise()`, and `mutate()` after +`group_by()`). Inside `dplyr` verbs, Arrow offers limited support for +functions and operators, with broader support expected in upcoming +releases. For more information about available compute functions, see +`help("list_compute_functions")`. + +For `dplyr` queries on `Table` objects, if the `arrow` package detects +an unimplemented function within a `dplyr` verb, it automatically calls +`collect()` to return the data as an R `data.frame` before processing +that `dplyr` verb. For queries on `Dataset` objects (which can be larger +than memory), it raises an error if the function is unimplemented. + +### Other uses + +Other uses of `arrow` are described in the following vignettes: + +- `vignette("python", package = "arrow")`: use `arrow` and + `reticulate` to pass data between R and Python +- `vignette("flight", package = "arrow")`: connect to Arrow Flight RPC + servers to send and receive data +- `vignette("arrow", package = "arrow")`: access and manipulate Arrow + objects through low-level bindings to the C++ library + +## Getting help + +If you encounter a bug, please file an issue with a minimal reproducible +example on the [Apache Jira issue +tracker](https://issues.apache.org/jira/projects/ARROW/issues). Create +an account or log in, then click **Create** to file an issue. Select the +project **Apache Arrow (ARROW)**, select the component **R**, and begin +the issue summary with **\[R\]** followed by a space. For more Review comment: Huh, weird, thanks. Easy dodge: I'll just make it `monospace` so no escapes are needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org