jonkeane commented on a change in pull request #9898:
URL: https://github.com/apache/arrow/pull/9898#discussion_r613626481



##########
File path: r/vignettes/developing.Rmd
##########
@@ -0,0 +1,510 @@
+---
+title: "Arrow R Developer Guide"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Arrow R Developer Guide}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r setup options, include=FALSE}
+knitr::opts_chunk$set(error = TRUE, eval = FALSE)
+
+# Get environment variables describing what to evaluate
+run <- tolower(Sys.getenv("RUN_DEVDOCS", "false")) == "true"
+macos <- tolower(Sys.getenv("DEVDOCS_MACOS", "false")) == "true"
+ubuntu <- tolower(Sys.getenv("DEVDOCS_UBUNTU", "false")) == "true"
+sys_install <- tolower(Sys.getenv("DEVDOCS_SYSTEM_INSTALL", "false")) == "true"
+
+# Update the source knit_hook to save the chunk (if it is marked to be saved)
+knit_hooks_source <- knitr::knit_hooks$get("source")
+knitr::knit_hooks$set(source = function(x, options) {
+  # Extra paranoia about when this will write the chunks to the script, we will
+  # only save when:
+  #   * CI is true
+  #   * RUN_DEVDOCS is true
+  #   * options$save is TRUE (and a check that not NULL won't crash it)
+  if (as.logical(Sys.getenv("CI", FALSE)) && run && !is.null(options$save) && 
options$save)
+    cat(x, file = "script.sh", append = TRUE, sep = "\n")
+  # but hide the blocks we want hidden:
+  if (!is.null(options$hide) && options$hide) {
+    return(NULL)
+  }
+  knit_hooks_source(x, options)
+})
+```
+
+```{bash, save=run, hide=TRUE}
+# Stop on failure, echo input as we go
+set -e
+set -x
+```
+
+If you're looking to contribute to `arrow`, this document can help you set up 
a development environment that will enable you to write code and run tests 
locally. It outlines how to build the various components that make up the Arrow 
project and R package, as well as some common troubleshooting and workflows 
developers use. Many contributions can be accomplished with the instructions in 
[R-only development](#r-only-development). But if you're working on both the 
C++ library and the R package, the [Developer environment 
setup](#-developer-environment-setup) section will guide you through setting up 
a developer environment.
+
+This document is intended only for developers of Apache Arrow or the Arrow R 
package. Users of the package in R do not need to do any of this setup. If 
you're looking for how to install Arrow, see [the instructions in the 
readme](https://arrow.apache.org/docs/r/#installation); Linux users can find 
more details on building from source at `vignette("install", package = 
"arrow")`.
+
+This document is a work in progress and will grow + change as the Apache Arrow 
project grows and changes. We have tried to make these steps as robust as 
possible (in fact, we even test exactly these instructions on our nightly CI to 
ensure they don't become stale!), but certain custom configurations might 
conflict with these instructions and there are differences of opinion across 
developers about if and what the one true way to set up development 
environments like this is.  We also solicit any feedback you have about things 
that are confusing or additions you would like to see here. Please [report an 
issue](https://issues.apache.org/jira/projects/ARROW/issues) if there you see 
anything that is confusing, odd, or just plain wrong.
+
+## R-only development
+
+Windows and macOS users who wish to contribute to the R package and
+don’t need to alter the Arrow C++ library may be able to obtain a
+recent version of the library without building from source. On macOS,
+you may install the C++ library using [Homebrew](https://brew.sh/):
+
+``` shell
+# For the released version:
+brew install apache-arrow
+# Or for a development version, you can try:
+brew install apache-arrow --HEAD
+```
+
+On Windows and Linux, you can download a .zip file with the arrow dependencies 
from the
+nightly repository,
+and then set the `RWINLIB_LOCAL` environment variable to point to that
+zip file before installing the `arrow` R package. Version numbers in that
+repository correspond to dates, and you will likely want the most recent.
+
+To see what nightlies are available, you can use Arrow's (or any other S3 
client's) S3 listing functionality to see what is in the bucket 
`s3://arrow-r-nightly/libarrow/bin`:
+
+```
+nightly <- s3_bucket("arrow-r-nightly")
+nightly$ls("libarrow/bin")
+```
+
+## Developer environment setup
+
+If you need to alter both the Arrow C++ library and the R package code, or if 
you can’t get a binary version of the latest C++ library elsewhere, you’ll need 
to build it from source too. This section discusses how to set up a C++ build 
configured to work with the R package. For more general resources, see the 
[Arrow C++ developer
+guide](https://arrow.apache.org/docs/developers/cpp/building.html).
+
+### Install dependencies {.tabset}
+
+The Arrow C++ library will by default use system dependencies if suitable 
versions are found; if they are not present, it will build them during its own 
build process. The only dependencies that one needs to install outside of the 
build process are `cmake` (for configuring the build) and `openssl` if you are 
building with S3 support.
+
+For a faster build, you may choose to install on the system more C++ library 
dependencies (such as `lz4`, `zstd`, etc.) so that they don't need to be built 
from source in the Arrow build. This is optional.
+
+#### macOS
+```{bash, save=run & macos}
+brew install cmake openssl
+```
+
+#### Ubuntu
+```{bash, save=run & ubuntu}
+sudo apt install -y cmake libcurl4-openssl-dev libssl-dev
+```
+
+### Configure the Arrow build {.tabset}
+
+You can choose to build and then install the Arrow library into a user-defined 
directory or into a system-level directory. You only need to do one of these 
two options.
+
+Either way, you will need to create a directory into which the C++ build will 
put its contents. It is recommended to make a `build` directory inside of the 
`cpp` directory of the Arrow git repository (it is git-ignored, so you won't 
accidentally check it in).
+
+Starting from the directory that contains your git checkout of `apache/arrow`,
+
+```{bash, save=run}
+mkdir -p arrow/cpp/build
+```
+
+#### Install to another directory
+
+It is recommended that you install the arrow library to a user-level directory 
to be used in development. In this example we will install it to a directory 
called `dist` that has the same parent as our `arrow` checkout, but it could be 
named or located anywhere you would like. However, note that your installation 
of the Arrow R package will point to this directory and need it to remain 
intact for the package to continue to work. This is one reason we recommend 
*not* placing it inside of the arrow git checkout.
+
+```{bash, save=run & !sys_install}
+export ARROW_HOME=$(pwd)/dist
+mkdir $ARROW_HOME
+```
+
+_Special instructions on Linux:_ You will need to set `LD_LIBRARY_PATH` to the 
`lib` directory that will is under where we set `$ARROW_HOME`, before launching 
R and using Arrow. One way to do this is to add it to your profile (we use 
`~/.bash_profile` here, but you might need to put this in a different file 
depending on your setup). On macOS we do not need to do this because the macOS 
shared library paths are hardcoded to their locations during build time.
+
+```{bash, save=run & ubuntu & !sys_install}
+export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH
+echo "export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH" >> 
~/.bash_profile
+```
+
+Now we can move into the arrow repository to start the build process. To 
build, change directories to be inside `arrow/cpp/build` (we do this in two 
steps so that we can use `popd` later to return to the `arrow` directory):
+
+```{bash, save=run & !sys_install}
+pushd arrow
+pushd cpp/build
+```
+
+You’ll first call `cmake` to configure the build and then `make install`. For 
the R package, you’ll need to enable several features in the C++ library using 
`-D` flags:
+
+```{bash, save=run & !sys_install}
+cmake \
+  -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
+  -DCMAKE_INSTALL_LIBDIR=lib \
+  -DARROW_COMPUTE=ON \
+  -DARROW_CSV=ON \
+  -DARROW_DATASET=ON \
+  -DARROW_FILESYSTEM=ON \
+  -DARROW_JEMALLOC=ON \
+  -DARROW_JSON=ON \
+  -DARROW_PARQUET=ON \
+  -DARROW_WITH_SNAPPY=ON \
+  -DARROW_WITH_ZLIB=ON \
+  -DARROW_INSTALL_NAME_RPATH=OFF \
+  ..
+```
+
+`..` refers to the C++ source directory: we're in `cpp/build`, and the source 
is in `cpp`.
+
+#### Install to the system
+
+If you would like to install Arrow as a system library you can do that as 
well. This is in some respects simpler, but if you already have Arrow libraries 
installed there, it would disrupt them and possibly require `sudo` permissions.
+
+Now we can move into the arrow repository to start the build process. To 
build, change directories to be inside `arrow/cpp/build` (we do this in two 
steps so that we can use `popd` later to return to the `arrow` directory):
+
+```{bash, save=run & sys_install}
+pushd arrow
+pushd cpp/build
+```
+
+You’ll first call `cmake` to configure the build and then `make install`. For 
the R package, you’ll need to enable several features in the C++ library using 
`-D` flags:
+
+```{bash, save=run & sys_install}
+cmake \
+  -DARROW_COMPUTE=ON \
+  -DARROW_CSV=ON \
+  -DARROW_DATASET=ON \
+  -DARROW_FILESYSTEM=ON \
+  -DARROW_JEMALLOC=ON \
+  -DARROW_JSON=ON \
+  -DARROW_PARQUET=ON \
+  -DARROW_WITH_SNAPPY=ON \
+  -DARROW_WITH_ZLIB=ON \
+  -DARROW_INSTALL_NAME_RPATH=OFF \
+  ..
+```
+
+`..` refers to the C++ source directory: we're in `cpp/build`, and the source 
is in `cpp`.
+
+### More Arrow features
+
+To enable optional features including: S3 support, an alternative memory 
allocator, and additional compression libraries, add some or all of these flags 
(the trailing `\` makes them easier to paste into a bash shell on a new line):
+
+``` shell
+  -DARROW_MIMALLOC=ON \
+  -DARROW_WITH_BROTLI=ON \
+  -DARROW_WITH_BZ2=ON \
+  -DARROW_WITH_LZ4=ON \
+  -DARROW_WITH_SNAPPY=ON \
+  -DARROW_WITH_ZSTD=ON \
+```
+
+Other flags that may be useful:
+
+* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library 
point to files and line numbers
+* `-DBoost_SOURCE=BUNDLED` and `-DThrift_SOURCE=bundled`, for example, or any 
other dependency `*_SOURCE`, if you have a system version of a C++ dependency 
that doesn't work correctly with Arrow. This tells the build to compile its own 
version of the dependency from source.
+* `-DCMAKE_BUILD_TYPE=debug` and `-DCMAKE_BUILD_TYPE=relwithdebinfo` can be 
useful for debugging, though they are both slower to compile than the default 
`release`.
+
+_Note_ `cmake` is particularly sensitive to whitespacing, if you see errors, 
check that you don't have any errant whitespace around
+
+### Build Arrow
+
+You can add `-j#` between `make` and `install` here too to speed up 
compilation by running in parallel (where `#` is the number of cores you have 
available).
+
+```{bash, save=run & !(sys_install & ubuntu)}
+make install
+```
+
+If you are installing on linux, and you are installing to the system, you may
+need to use `sudo`:
+
+```{bash, save=run & sys_install & ubuntu}
+sudo make install
+```
+
+Note that after any change to the C++ library, you must reinstall it and
+run `make clean` or `git clean -fdx .` to remove any cached object code
+in the `r/src/` directory before reinstalling the R package. This is
+only necessary if you make changes to the C++ library source; you do not
+need to manually purge object files if you are only editing R or C++
+code inside `r/`.
+
+
+### Build the Arrow R package
+
+Once you’ve built the C++ library, you can install the R package and its
+dependencies, along with additional dev dependencies, from the git
+checkout:
+
+```{bash, save=run}
+popd # To go back to the root directory of the project, from cpp/build
+
+pushd r
+R -e 'install.packages("remotes"); remotes::install_deps(dependencies = TRUE)'
+
+R CMD INSTALL .
+```
+
+### Compilation flags
+
+If you need to set any compilation flags while building the C++
+extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
+example, if you are using `perf` to profile the R extensions, you may
+need to set
+
+``` shell
+export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+```
+
+### Developer Experience
+
+With the setups described here, you should not need to rebuild the Arrow 
library or even the C++ source in the R package as you iterated and work on the 
R package. The only time those should need to be rebuilt is if you have changed 
the C++ in the R package (and even then, `R CMD INSTALL .` should only need to 
recompile the files that have changed) _or_ if the Arrow library C++ has 
changed and there is a mismatch between the Arrow Library and the R package. If 
you find yourself rebuilding either or both each time you install the package 
or run tests, something is probably wrong with your set up.
+
+<details>
+<summary>For a full build: a `cmake` command with all of the R-relevant 
optional dependencies turned on</summary>
+<p>
+
+``` shell
+cmake \
+  -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
+  -DCMAKE_INSTALL_LIBDIR=lib \
+  -DARROW_COMPUTE=ON \
+  -DARROW_CSV=ON \
+  -DARROW_DATASET=ON \
+  -DARROW_FILESYSTEM=ON \
+  -DARROW_JEMALLOC=ON \
+  -DARROW_JSON=ON \
+  -DARROW_PARQUET=ON \
+  -DARROW_WITH_SNAPPY=ON \
+  -DARROW_WITH_ZLIB=ON \
+  -DARROW_INSTALL_NAME_RPATH=OFF \
+  -DARROW_MIMALLOC=ON \
+  -DARROW_WITH_BROTLI=ON \
+  -DARROW_WITH_BZ2=ON \
+  -DARROW_WITH_LZ4=ON \
+  -DARROW_WITH_SNAPPY=ON \
+  -DARROW_WITH_ZSTD=ON \
+  ..
+```
+</p>
+</details>  
+
+## Troublshooting
+
+### Arrow library-R package mismatches
+
+If the Arrow library and the R package have diverged, you will see errors like:
+
+```
+Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
= DLLpath, ...):
+ unable to load shared object 
'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so':
+  
dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so,
 6): Symbol not found: 
__ZN5arrow2io16RandomAccessFile9ReadAsyncERKNS0_9IOContextExx
+  Referenced from: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so
+  Expected in: flat namespace
+ in 
/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so
+Error: loading failed
+Execution halted
+ERROR: loading failed
+```
+
+To resolve this, try rebuilding the Arrow library from [Building Arrow 
above](#building-arrow).
+
+### Multiple versions of Arrow library
+
+If rebuilding the Arrow library doesn't work and you are [installing from a 
user-level directory](#installing-to-another-directory) and you already have a 
previous installation of libarrow in a system directory or you get you may get 
errors like the following when you install the R package:
+
+```
+Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
= DLLpath, ...):
+ unable to load shared object 
'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so':
+  
dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so,
 6): Library not loaded: /usr/local/lib/libarrow.400.dylib
+  Referenced from: /usr/local/lib/libparquet.400.dylib
+  Reason: image not found
+```
+
+You need to make sure that you don't let R link to your system library when 
building arrow. You can do this a number of different ways:
+
+* Setting the `MAKEFLAGS` environment variable to `"LDFLAGS="` (see below for 
an example) this is the recommended way to accomplish this
+* Using {withr}'s `with_makevars(list(LDFLAGS = ""), ...)`
+* adding `LDFLAGS=` to your `~/.R/Makevars` file (the least recommended way, 
though it is a common debugging approach suggested online)
+
+```{bash, save=run & !sys_install & macos, hide=TRUE}
+# Setup troubleshooting section
+# install a system-level arrow on macOS
+brew install apache-arrow
+```
+
+
+```{bash, save=run & !sys_install & ubuntu, hide=TRUE}
+# Setup troubleshooting section
+# install a system-level arrow on macOS
+sudo apt update
+sudo apt install -y -V ca-certificates lsb-release wget
+wget https://apache.bintray.com/arrow/$(lsb_release --id --short | tr 'A-Z' 
'a-z')/apache-arrow-archive-keyring-latest-$(lsb_release --codename --short).deb
+sudo apt install -y -V ./apache-arrow-archive-keyring-latest-$(lsb_release 
--codename --short).deb
+sudo apt update
+sudo apt install -y -V libarrow-dev
+```
+
+```{bash, save=run & !sys_install & macos}
+MAKEFLAGS="LDFLAGS=" R CMD INSTALL .
+```
+
+
+### `rpath` issues
+
+If the package fails to install/load with an error like this:
+
+```
+  ** testing if installed package can be loaded from temporary location
+  Error: package or namespace load failed for 'arrow' in dyn.load(file, 
DLLpath = DLLpath, ...):
+  unable to load shared object 
'/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
+  dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not 
loaded: @rpath/libarrow.14.dylib
+```
+
+ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
+macOS to prevent problems at link time and is a no-op on other platforms).
+Alternatively, try setting the environment variable `R_LD_LIBRARY_PATH` to
+wherever Arrow C++ was put in `make install`, e.g. `export
+R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+
+When installing from source, if the R and C++ library versions do not
+match, installation may fail. If you’ve previously installed the
+libraries and want to upgrade the R package, you’ll need to update the
+Arrow C++ library first.
+
+For any other build/configuration challenges, see the [C++ developer
+guide](https://arrow.apache.org/docs/developers/cpp/building.html).
+
+
+## Using `remotes::install_github(...)`
+
+If you need an Arrow installation from a specific repository or at a specific 
ref,
+`remotes::install_github("apache/arrow/r", build = FALSE)`
+should work on most platforms (with the notable exception of Windows).
+The `build = FALSE` argument is important so that the installation can access 
the
+C++ source in the `cpp/` directory in `apache/arrow`.
+
+As with other installation methods, setting the environment variables 
`LIBARROW_MINIMAL=false` and `ARROW_R_DEV=true` will provide a more 
full-featured version of Arrow and provide more verbose output, respectively.
+
+For example, to install from the (fictional) branch `bugfix` from 
`apache/arrow` one could:
+
+```r
+Sys.setenv(LIBARROW_MINIMAL="false")
+remotes::install_github("apache/arrow/r@bugfix", build = FALSE)
+```
+
+Developers may wish to use this method of installing a specific commit
+separate from another Arrow development environment or system installation
+(e.g. we use this in [arrowbench](https://github.com/ursacomputing/arrowbench) 
to install development versions of arrow isolated from the system install). If 
you already have Arrow C++ libraries installed system-wide, you may need to set 
some additional variables in order to isolate this build from your system 
libraries:
+
+* Setting the environment variable `FORCE_BUNDLED_BUILD` to `true` will skip 
the `pkg-config` search for Arrow libraries and attempt to build from the same 
source at the repository+ref given.
+* You may also need to set the Makevars `CPPFLAGS` and `LDFLAGS` to `""` in 
order to prevent the installation process from attempting to link to already 
installed system versions of Arrow. One way to do this temporarily is wrapping 
your `remotes::install_github()` call like so: 
`withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), 
remotes::install_github(...))`.
+
+## How does our automated building actually work?
+
+There are a number of scripts that are triggered when `R CMD INSTALL .`. For 
Arrow users, these should all just work without configuration and pull in the 
most complete pieces (e.g. official binaries that we host) so the installation 
process is easy. However knowing about these scripts can help troubleshoot if 
things go wrong in them or things go wrong in an install:
+
+* `configure` and `configure.win` These scripts are triggered during `R CMD 
INSTALL .` on non-windows and windows platforms respectively. They handle 
finding the Arrow library, setting up the build variables necessary, and 
writing the package Makevars file that is used to compile the C++ code in the R 
package.
+* `tools/nixlibs.R` This script is called by `configure` on Linux (or on any 
non-windows OS with the environment variable `FORCE_BUNDLED_BUILD=true`). This 
sets up the build process for our bundled builds (which is the default on 
linux). The operative logic is at the end of the script, but it will do the 
following (and it will stop with the first one that succeeds and some of the 
steps are only checked if they are enabled via an environment variable):
+  * Check if there is an already built libarrow in 
`arrow/r/libarrow-{version}`, use that to link against if it exists.
+  * Check if a binary is available from our hosted official builds.
+  * Download the Arrow source and build the Arrow Library from source.
+  * `*** Proceed without C++` dependencies (this is an error and the package 
will not work, but if you see this message you know the previous steps have not 
succeeded/were not enabled)
+* `inst/build_arrow_static.sh` this script builds Arrow for a bundled, static 
build. It is called by `tools/nixlibs.R` when the Arrow library is being built.
+
+## Editing C++ code
+
+The `arrow` package uses some customized tools on top of `cpp11` to

Review comment:
       TIL. Are there any other reasons I should add here too?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to