nealrichardson commented on code in PR #49068: URL: https://github.com/apache/arrow/pull/49068#discussion_r2759115062
########## r/vignettes/developers/binary_features.Rmd: ########## @@ -0,0 +1,203 @@ +--- +title: "Libarrow binary features" +description: > + Understanding which C++ features are enabled in Arrow R package builds +output: rmarkdown::html_vignette +--- + +This document explains which C++ features are enabled in different Arrow R +package build configurations, and documents the decisions behind our default +feature set. + +## Overview + +When the Arrow R package is installed, it needs a copy of the Arrow C++ library +(libarrow). This can come from: + +1. **Prebuilt binaries** we host (for releases and nightlies) +2. **Source builds** when binaries aren't available or users opt out + +The features available in libarrow depend on how it was built. This document +covers the feature configuration for both scenarios. + +## Feature configuration in source builds + +Source builds are controlled by `r/inst/build_arrow_static.sh`. The key +environment variable is `LIBARROW_MINIMAL`: + +- `LIBARROW_MINIMAL=true` (or unset): Minimal feature set +- `LIBARROW_MINIMAL=false`: Full feature set + +### Features always enabled + +These features are always built regardless of `LIBARROW_MINIMAL`: + +| Feature | CMake Flag | Notes | +|---------|------------|-------| +| Compute | `ARROW_COMPUTE=ON` | Core compute functions | +| CSV | `ARROW_CSV=ON` | CSV reading/writing | +| Filesystem | `ARROW_FILESYSTEM=ON` | Local filesystem support | +| JSON | `ARROW_JSON=ON` | JSON reading | +| Parquet | `ARROW_PARQUET=ON` | Parquet file format | +| Dataset | `ARROW_DATASET=ON` | Multi-file datasets | +| Acero | `ARROW_ACERO=ON` | Query execution engine | +| Mimalloc | `ARROW_MIMALLOC=ON` | Memory allocator | +| LZ4 | `ARROW_WITH_LZ4=ON` | LZ4 compression | +| Snappy | `ARROW_WITH_SNAPPY=ON` | Snappy compression | +| RE2 | `ARROW_WITH_RE2=ON` | Regular expressions | +| UTF8Proc | `ARROW_WITH_UTF8PROC=ON` | Unicode support | + +### Features controlled by LIBARROW_MINIMAL + +When `LIBARROW_MINIMAL=false`, the following additional features are enabled +(via `$ARROW_DEFAULT_PARAM=ON`): + +| Feature | CMake Flag | Default | +|---------|------------|---------| +| S3 | `ARROW_S3` | `$ARROW_DEFAULT_PARAM` | +| Jemalloc | `ARROW_JEMALLOC` | `$ARROW_DEFAULT_PARAM` | +| Brotli | `ARROW_WITH_BROTLI` | `$ARROW_DEFAULT_PARAM` | +| BZ2 | `ARROW_WITH_BZ2` | `$ARROW_DEFAULT_PARAM` | +| Zlib | `ARROW_WITH_ZLIB` | `$ARROW_DEFAULT_PARAM` | +| Zstd | `ARROW_WITH_ZSTD` | `$ARROW_DEFAULT_PARAM` | + +### Features that require explicit opt-in + +GCS (Google Cloud Storage) is **always off by default**, even when +`LIBARROW_MINIMAL=false`: + +| Feature | CMake Flag | Default | Reason | +|---------|------------|---------|--------| +| GCS | `ARROW_GCS` | `OFF` | Build complexity, dependency size | + +To enable GCS in a source build, you must explicitly set `ARROW_GCS=ON`. + +**Why is GCS off by default?** + +GCS was turned off by default in [#48343](https://github.com/apache/arrow/pull/48343) +(December 2025) because: + +1. Building google-cloud-cpp is fragile and adds significant build time +2. The dependency on abseil (ABSL) has caused compatibility issues +3. Usage telemetry suggested low adoption compared to S3 Review Comment: telemetry? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
