jonkeane commented on code in PR #49068: URL: https://github.com/apache/arrow/pull/49068#discussion_r2759200110
########## r/vignettes/developers/binary_features.Rmd: ########## @@ -0,0 +1,203 @@ +--- +title: "Libarrow binary features" +description: > + Understanding which C++ features are enabled in Arrow R package builds +output: rmarkdown::html_vignette +--- + +This document explains which C++ features are enabled in different Arrow R +package build configurations, and documents the decisions behind our default +feature set. + +## Overview + +When the Arrow R package is installed, it needs a copy of the Arrow C++ library +(libarrow). This can come from: + +1. **Prebuilt binaries** we host (for releases and nightlies) +2. **Source builds** when binaries aren't available or users opt out + +The features available in libarrow depend on how it was built. This document +covers the feature configuration for both scenarios. + +## Feature configuration in source builds + +Source builds are controlled by `r/inst/build_arrow_static.sh`. The key +environment variable is `LIBARROW_MINIMAL`: + +- `LIBARROW_MINIMAL=true` (or unset): Minimal feature set +- `LIBARROW_MINIMAL=false`: Full feature set + +### Features always enabled + +These features are always built regardless of `LIBARROW_MINIMAL`: + +| Feature | CMake Flag | Notes | +|---------|------------|-------| +| Compute | `ARROW_COMPUTE=ON` | Core compute functions | +| CSV | `ARROW_CSV=ON` | CSV reading/writing | +| Filesystem | `ARROW_FILESYSTEM=ON` | Local filesystem support | +| JSON | `ARROW_JSON=ON` | JSON reading | +| Parquet | `ARROW_PARQUET=ON` | Parquet file format | +| Dataset | `ARROW_DATASET=ON` | Multi-file datasets | +| Acero | `ARROW_ACERO=ON` | Query execution engine | +| Mimalloc | `ARROW_MIMALLOC=ON` | Memory allocator | +| LZ4 | `ARROW_WITH_LZ4=ON` | LZ4 compression | +| Snappy | `ARROW_WITH_SNAPPY=ON` | Snappy compression | +| RE2 | `ARROW_WITH_RE2=ON` | Regular expressions | +| UTF8Proc | `ARROW_WITH_UTF8PROC=ON` | Unicode support | + +### Features controlled by LIBARROW_MINIMAL + +When `LIBARROW_MINIMAL=false`, the following additional features are enabled +(via `$ARROW_DEFAULT_PARAM=ON`): + +| Feature | CMake Flag | Default | +|---------|------------|---------| +| S3 | `ARROW_S3` | `$ARROW_DEFAULT_PARAM` | +| Jemalloc | `ARROW_JEMALLOC` | `$ARROW_DEFAULT_PARAM` | +| Brotli | `ARROW_WITH_BROTLI` | `$ARROW_DEFAULT_PARAM` | +| BZ2 | `ARROW_WITH_BZ2` | `$ARROW_DEFAULT_PARAM` | +| Zlib | `ARROW_WITH_ZLIB` | `$ARROW_DEFAULT_PARAM` | +| Zstd | `ARROW_WITH_ZSTD` | `$ARROW_DEFAULT_PARAM` | + +### Features that require explicit opt-in + +GCS (Google Cloud Storage) is **always off by default**, even when +`LIBARROW_MINIMAL=false`: + +| Feature | CMake Flag | Default | Reason | +|---------|------------|---------|--------| +| GCS | `ARROW_GCS` | `OFF` | Build complexity, dependency size | + +To enable GCS in a source build, you must explicitly set `ARROW_GCS=ON`. + +**Why is GCS off by default?** + +GCS was turned off by default in [#48343](https://github.com/apache/arrow/pull/48343) +(December 2025) because: + +1. Building google-cloud-cpp is fragile and adds significant build time +2. The dependency on abseil (ABSL) has caused compatibility issues +3. Usage telemetry suggested low adoption compared to S3 +4. Users who need GCS can still enable it explicitly + +## Prebuilt binary configuration + +We produce prebuilt libarrow binaries for macOS, Windows, and Linux. These +binaries include **more features** than the default source build to provide +users with a fully-featured experience out of the box. + +### Current binary feature set + +| Platform | S3 | GCS | Configured in | +|----------|----|----|---------------| +| macOS (ARM64, x86_64) | ON | ON | `dev/tasks/r/github.packages.yml` | +| Windows | ON | ON | `ci/scripts/PKGBUILD` | +| Linux (x86_64) | ON | ON | `compose.yaml` (`ubuntu-cpp-static`) | + +### Why binaries have GCS enabled + +Even though GCS defaults to OFF for source builds, we explicitly enable it in +our prebuilt binaries because: + +1. **Binary users expect features to "just work"** - they shouldn't need to + rebuild from source to access cloud storage +2. **Build time is not a concern** - we build binaries once in CI, not on + user machines +3. **Parity across platforms** - users get the same features regardless of OS + +## Configuration file locations + +### Source build configuration + +The main build script that controls source builds: + +- **`r/inst/build_arrow_static.sh`** - CMake flags and defaults + ([view source](https://github.com/apache/arrow/blob/main/r/inst/build_arrow_static.sh)) + +Key lines: +```bash +# Line 52-56: LIBARROW_MINIMAL controls ARROW_DEFAULT_PARAM +if [ "$LIBARROW_MINIMAL" = "false" ]; then + ARROW_DEFAULT_PARAM="ON" +else + ARROW_DEFAULT_PARAM="OFF" +fi + +# Line 81: GCS is always OFF by default (not tied to LIBARROW_MINIMAL) +-DARROW_GCS=${ARROW_GCS:-OFF} + +# Line 86: S3 follows LIBARROW_MINIMAL +-DARROW_S3=${ARROW_S3:-$ARROW_DEFAULT_PARAM} +``` + +### Binary build configuration Review Comment: On top of the additional note at the top I'm also putting `libarrow` in a bunch of these headings to make it clear(er) that this is about libarrow (and maybe `lib...` will also scare people away as an internal reference—which it is!) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
