nealrichardson commented on a change in pull request #11001:
URL: https://github.com/apache/arrow/pull/11001#discussion_r698674825
##########
File path: r/R/install-arrow.R
##########
@@ -137,3 +136,68 @@ reload_arrow <- function() {
message("Please restart R to use the 'arrow' package.")
}
}
+
+
+#' Download all optional Arrow dependencies
+#'
+#' @param deps_dir Directory to save files into. Will be created if necessary.
+#' Defaults to the value of `ARROW_THIRDPARTY_DEPENDENCY_DIR`, if that
+#' environment variable is set.
+#'
+#' @return `deps_dir`, invisibly
+#'
+#' This function is used for setting up an offline build. If it's possible to
+#' download at build time, don't use this function. Instead, let `cmake`
+#' download them for you.
+#' If the files already exist in `deps_dir`, they will be re-downloaded and
+#' overwritten. Do not put other files in this directory.
+#' These saved files are only used in the build if `ARROW_DEPENDENCY_SOURCE`
+#' is `BUNDLED` or `AUTO`.
+#' https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds
+#'
+#' ## Steps for an offline install with optional dependencies:
+#'
+#' ### On a computer with internet access:
+#' - Install the `arrow` package
Review comment:
```suggestion
#' - If you don't already have the `arrow` package installed, get this
function by
#'
`source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")`
```
##########
File path: r/tools/nixlibs.R
##########
@@ -373,7 +374,15 @@ ensure_cmake <- function() {
)
cmake_tar <- tempfile()
cmake_dir <- tempfile()
- try_download(cmake_binary_url, cmake_tar)
+ download_successful <- try_download(cmake_binary_url, cmake_tar)
+ if (!download_successful) {
+ cat(paste0(
+ "*** cmake was not found locally and download failed.\n",
+ " Make sure cmake is installed and available on your PATH\n",
+ " (or download '", cmake_binary_url,
+ "' and define the CMAKE environment variable).\n"
+ ))
Review comment:
```suggestion
cat(paste0(
"*** cmake was not found locally and download failed.\n",
" Make sure cmake >= 3.10 is installed and available on your
PATH,\n",
" or download ", cmake_binary_url, "\n",
" and define the CMAKE environment variable.\n"
))
```
##########
File path: r/R/install-arrow.R
##########
@@ -137,3 +136,68 @@ reload_arrow <- function() {
message("Please restart R to use the 'arrow' package.")
}
}
+
+
+#' Download all optional Arrow dependencies
+#'
+#' @param deps_dir Directory to save files into. Will be created if necessary.
+#' Defaults to the value of `ARROW_THIRDPARTY_DEPENDENCY_DIR`, if that
+#' environment variable is set.
+#'
+#' @return `deps_dir`, invisibly
+#'
+#' This function is used for setting up an offline build. If it's possible to
+#' download at build time, don't use this function. Instead, let `cmake`
+#' download them for you.
+#' If the files already exist in `deps_dir`, they will be re-downloaded and
+#' overwritten. Do not put other files in this directory.
+#' These saved files are only used in the build if `ARROW_DEPENDENCY_SOURCE`
+#' is `BUNDLED` or `AUTO`.
+#' https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds
+#'
+#' ## Steps for an offline install with optional dependencies:
+#'
+#' ### On a computer with internet access:
+#' - Install the `arrow` package
+#' - Run this function
+#' - Copy the saved dependency files to the computer with internet access
+#'
+#' ### On the computer without internet access:
+#' - Create a environment variable called `ARROW_THIRDPARTY_DEPENDENCY_DIR`
that
+#' points to the newly copied folder of dependency files.
+#' - Install the `arrow` package
+#' - Run [arrow_info()] to check installed capabilities
+#'
+#' @examples
+#' \dontrun{
+#' download_optional_dependencies("arrow-thirdparty")
+#' list.files("arrow-thirdparty", "thrift-*") # "thrift-0.13.0.tar.gz" or
similar
+#' }
+#' @export
+download_optional_dependencies <- function(deps_dir = NULL) {
+ # This script is copied over from arrow/cpp/... to arrow/r/inst/...
+ download_dependencies_sh <- system.file(
+ "thirdparty/download_dependencies.sh",
+ package = "arrow",
+ mustWork = TRUE
+ )
+ if (is.null(deps_dir) && Sys.getenv("ARROW_THIRDPARTY_DEPENDENCY_DIR") !=
"") {
+ deps_dir <- Sys.getenv("ARROW_THIRDPARTY_DEPENDENCY_DIR")
+ }
+
+ dir.create(deps_dir, showWarnings = FALSE, recursive = TRUE)
+ # Run download_dependencies.sh
+ cat(paste0("*** Downloading optional dependencies to ", deps_dir, "\n"))
+ return_status <- system2(download_dependencies_sh,
+ args = deps_dir, stdout = FALSE, stderr = FALSE
+ )
+ if (isTRUE(return_status == 0)) {
+ cat(paste0(
+ "**** Set environment variable on offline machine and re-build arrow:\n",
Review comment:
Should this message also tell you to copy the directory to the other
machine?
##########
File path: r/R/install-arrow.R
##########
@@ -137,3 +136,68 @@ reload_arrow <- function() {
message("Please restart R to use the 'arrow' package.")
}
}
+
+
+#' Download all optional Arrow dependencies
+#'
+#' @param deps_dir Directory to save files into. Will be created if necessary.
+#' Defaults to the value of `ARROW_THIRDPARTY_DEPENDENCY_DIR`, if that
+#' environment variable is set.
+#'
+#' @return `deps_dir`, invisibly
+#'
+#' This function is used for setting up an offline build. If it's possible to
+#' download at build time, don't use this function. Instead, let `cmake`
+#' download them for you.
+#' If the files already exist in `deps_dir`, they will be re-downloaded and
+#' overwritten. Do not put other files in this directory.
+#' These saved files are only used in the build if `ARROW_DEPENDENCY_SOURCE`
+#' is `BUNDLED` or `AUTO`.
+#' https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds
+#'
+#' ## Steps for an offline install with optional dependencies:
+#'
+#' ### On a computer with internet access:
+#' - Install the `arrow` package
+#' - Run this function
+#' - Copy the saved dependency files to the computer with internet access
+#'
+#' ### On the computer without internet access:
+#' - Create a environment variable called `ARROW_THIRDPARTY_DEPENDENCY_DIR`
that
+#' points to the newly copied folder of dependency files.
+#' - Install the `arrow` package
+#' - Run [arrow_info()] to check installed capabilities
+#'
+#' @examples
+#' \dontrun{
+#' download_optional_dependencies("arrow-thirdparty")
+#' list.files("arrow-thirdparty", "thrift-*") # "thrift-0.13.0.tar.gz" or
similar
+#' }
+#' @export
+download_optional_dependencies <- function(deps_dir = NULL) {
+ # This script is copied over from arrow/cpp/... to arrow/r/inst/...
+ download_dependencies_sh <- system.file(
+ "thirdparty/download_dependencies.sh",
+ package = "arrow",
+ mustWork = TRUE
+ )
+ if (is.null(deps_dir) && Sys.getenv("ARROW_THIRDPARTY_DEPENDENCY_DIR") !=
"") {
+ deps_dir <- Sys.getenv("ARROW_THIRDPARTY_DEPENDENCY_DIR")
+ }
Review comment:
```suggestion
download_optional_dependencies <- function(deps_dir =
Sys.getenv("ARROW_THIRDPARTY_DEPENDENCY_DIR")) {
# This script is copied over from arrow/cpp/... to arrow/r/inst/...
download_dependencies_sh <- system.file(
"thirdparty/download_dependencies.sh",
package = "arrow",
mustWork = TRUE
)
```
##########
File path: r/vignettes/install.Rmd
##########
@@ -304,10 +316,12 @@ By default, these are all unset. All boolean variables
are case-insensitive.
won't look for Arrow libraries on your system and instead will look to
download/build them.
Use this if you have a version mismatch between installed system libraries
and the version of the R package you're installing.
-* `LIBARROW_DOWNLOAD`: Unless set to `false`, the build script
- will attempt to download C++ binary or source bundles.
+* `TEST_OFFLINE_BUILD`: Unless set to `true`, the build script
+ will download prebuilt C++ binary or third-party source bundles as necessary.
If you're in a checkout of the `apache/arrow` git repository
Review comment:
```suggestion
```
##########
File path: r/vignettes/install.Rmd
##########
@@ -102,6 +102,14 @@ satisfy C++ dependencies.
> Note that, unlike packages like `tensorflow`, `blogdown`, and others that
> require external dependencies, you do not need to run `install_arrow()`
> after a successful `arrow` installation.
+The `install-arrow.R` file also includes the `download_optional_dependencies()`
+function. Normally, when installing on a computer with internet access, the
+build process will download third-party dependencies as needed. This function
+provides a way to download them in advance. Relevant environment variables are
+`ARROW_THIRDPARTY_DEPENDENCY_DIR` for the directory of downloaded dependencies
+and `TEST_OFFLINE_BUILD` to force the build process not to download.
Review comment:
I don't think we should document this in this vignette--users should not
worry with this env var, it's for us for testing
##########
File path: r/vignettes/install.Rmd
##########
@@ -304,10 +316,12 @@ By default, these are all unset. All boolean variables
are case-insensitive.
won't look for Arrow libraries on your system and instead will look to
download/build them.
Use this if you have a version mismatch between installed system libraries
and the version of the R package you're installing.
-* `LIBARROW_DOWNLOAD`: Unless set to `false`, the build script
- will attempt to download C++ binary or source bundles.
+* `TEST_OFFLINE_BUILD`: Unless set to `true`, the build script
+ will download prebuilt C++ binary or third-party source bundles as necessary.
If you're in a checkout of the `apache/arrow` git repository
- and want to build the C++ library from the local source, make this `false`.
+ and want to build the C++ library from the local source, make this `false` or
+ not set. If building the C++ library from source with cmake unavailable,
cmake
Review comment:
```suggestion
If building the C++ library from source with cmake unavailable, cmake
```
##########
File path: r/tools/nixlibs.R
##########
@@ -29,17 +29,8 @@ if (getRversion() < 3.4 &&
is.null(getOption("download.file.method"))) {
options(.arrow.cleanup = character()) # To collect dirs to rm on exit
on.exit(unlink(getOption(".arrow.cleanup")))
+
Review comment:
```suggestion
```
##########
File path: r/tools/nixlibs.R
##########
@@ -320,33 +300,54 @@ build_libarrow <- function(src_dir, dst_dir) {
BUILD_DIR = build_dir,
DEST_DIR = dst_dir,
CMAKE = cmake,
+ # EXTRA_CMAKE_FLAGS will often be "", but it's convenient later to have it
defined
Review comment:
Why?
##########
File path: r/vignettes/install.Rmd
##########
@@ -102,6 +102,14 @@ satisfy C++ dependencies.
> Note that, unlike packages like `tensorflow`, `blogdown`, and others that
> require external dependencies, you do not need to run `install_arrow()`
> after a successful `arrow` installation.
+The `install-arrow.R` file also includes the `download_optional_dependencies()`
+function. Normally, when installing on a computer with internet access, the
+build process will download third-party dependencies as needed. This function
+provides a way to download them in advance. Relevant environment variables are
Review comment:
These sentences should probably mention the offline/airgapped server use
case and how you'd use it.
##########
File path: r/tools/nixlibs.R
##########
@@ -413,66 +422,144 @@ cmake_version <- function(cmd = "cmake") {
)
}
-with_s3_support <- function(env_vars) {
- arrow_s3 <- toupper(Sys.getenv("ARROW_S3")) == "ON" ||
tolower(Sys.getenv("LIBARROW_MINIMAL")) == "false"
+turn_off_thirdparty_features <- function(env_var_list) {
+ # Because these are done as environment variables (as opposed to build
flags),
+ # setting these to "OFF" overrides any previous setting. We don't need to
+ # check the existing value.
+ turn_off <- c(
+ "ARROW_MIMALLOC" = "OFF",
+ "ARROW_JEMALLOC" = "OFF",
+ "ARROW_PARQUET" = "OFF", # depends on thrift
+ "ARROW_DATASET" = "OFF", # depends on parquet
+ "ARROW_S3" = "OFF",
+ "ARROW_WITH_BROTLI" = "OFF",
+ "ARROW_WITH_BZ2" = "OFF",
+ "ARROW_WITH_LZ4" = "OFF",
+ "ARROW_WITH_SNAPPY" = "OFF",
+ "ARROW_WITH_ZLIB" = "OFF",
+ "ARROW_WITH_ZSTD" = "OFF",
+ "ARROW_WITH_RE2" = "OFF",
+ "ARROW_WITH_UTF8PROC" = "OFF",
+ # NOTE: this code sets the environment variable ARROW_JSON to "OFF", but
+ # that setting is will *not* be honored by build_arrow_static.sh until
+ # ARROW-13768 is resolved.
+ "ARROW_JSON" = "OFF",
+ # The syntax to turn off XSIMD is different.
+ # Pull existing value of EXTRA_CMAKE_FLAGS first (must be defined)
+ "EXTRA_CMAKE_FLAGS" = paste(
+ env_var_list[["EXTRA_CMAKE_FLAGS"]],
+ "-DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE"
+ )
+ )
+ # Create a new env_var_list, with the values of turn_off set.
+ # replace() also adds new values if they didn't exist before
+ replace(env_var_list, names(turn_off), turn_off)
+}
+
+set_thirdparty_urls <- function(env_var_list) {
+ # This function is run in most typical cases -- when download_ok is TRUE *or*
+ # ARROW_THIRDPARTY_DEPENDENCY_DIR is set. It does *not* check if existing
+ # *_SOURCE_URL variables are set. (It is also run whenever
ARROW_DEPENDENCY_SOURCE
+ # is "SYSTEM", but doesn't affect the build in that case.)
+ deps_dir <- Sys.getenv("ARROW_THIRDPARTY_DEPENDENCY_DIR")
+ if (deps_dir == "") {
+ return(env_var_list)
+ }
+ files <- list.files(deps_dir, full.names = FALSE)
+ if (length(files) == 0) {
+ # This will be true if the directory doesn't exist, or if it exists but is
empty.
+ # Here the build will continue, but will likely fail when the downloads are
+ # unavailable. The user will end up with the arrow-without-arrow package.
+ cat(paste0(
+ "*** Error: ARROW_THIRDPARTY_DEPENDENCY_DIR was set but has no files.\n",
Review comment:
```suggestion
"*** Warning: ARROW_THIRDPARTY_DEPENDENCY_DIR was set but has no
files.\n",
```
##########
File path: r/tools/nixlibs.R
##########
@@ -52,6 +43,24 @@ try_download <- function(from_url, to_file) {
!inherits(status, "try-error") && status == 0
}
+build_ok <- !env_is("LIBARROW_BUILD", "false")
+# But binary defaults to not OK
+binary_ok <- !identical(tolower(Sys.getenv("LIBARROW_BINARY", "false")),
"false")
+# For local debugging, set ARROW_R_DEV=TRUE to make this script print more
+
+quietly <- !env_is("ARROW_R_DEV", "true") # try_download uses quietly global
+# * download_ok, build_ok: Use prebuilt binary, if found, otherwise try to
build
+# * !download_ok, build_ok: Build with local git checkout, if available, or
+# sources included in r/tools/cpp/. Optional dependencies are not included,
+# and will not be automatically downloaded.
+# cmake will still be downloaded if necessary
+# https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds
+# * download_ok, !build_ok: Only use prebuilt binary, if found
+# * neither: Get the arrow-without-arrow package
+# Download and build are OK unless you say not to (or can't access github)
+download_ok <- !env_is("TEST_OFFLINE_BUILD", "true") &&
try_download("https://github.com", tempfile())
+
+
Review comment:
```suggestion
# For local debugging, set ARROW_R_DEV=TRUE to make this script print more
quietly <- !env_is("ARROW_R_DEV", "true")
# Default is build from source, not download a binary
build_ok <- !env_is("LIBARROW_BUILD", "false")
binary_ok <- !identical(tolower(Sys.getenv("LIBARROW_BINARY", "false")),
"false")
# Check if we're doing an offline build.
# (Note that cmake will still be downloaded if necessary
# https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds)
download_ok <- !env_is("TEST_OFFLINE_BUILD", "true") &&
try_download("https://github.com", tempfile())
```
##########
File path: r/vignettes/install.Rmd
##########
@@ -343,6 +357,7 @@ By default, these are all unset. All boolean variables are
case-insensitive.
* `CMAKE`: When building the C++ library from source, you can specify a
`/path/to/cmake` to use a different version than whatever is found on the
`$PATH`
+
Review comment:
```suggestion
```
##########
File path: r/R/install-arrow.R
##########
@@ -137,3 +136,68 @@ reload_arrow <- function() {
message("Please restart R to use the 'arrow' package.")
}
}
+
+
+#' Download all optional Arrow dependencies
+#'
+#' @param deps_dir Directory to save files into. Will be created if necessary.
+#' Defaults to the value of `ARROW_THIRDPARTY_DEPENDENCY_DIR`, if that
+#' environment variable is set.
+#'
+#' @return `deps_dir`, invisibly
+#'
+#' This function is used for setting up an offline build. If it's possible to
+#' download at build time, don't use this function. Instead, let `cmake`
+#' download them for you.
+#' If the files already exist in `deps_dir`, they will be re-downloaded and
+#' overwritten. Do not put other files in this directory.
+#' These saved files are only used in the build if `ARROW_DEPENDENCY_SOURCE`
+#' is `BUNDLED` or `AUTO`.
+#' https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds
+#'
+#' ## Steps for an offline install with optional dependencies:
+#'
+#' ### On a computer with internet access:
+#' - Install the `arrow` package
Review comment:
Oh, I guess you're also relying on the package installation to deliver
the download_dependencies.sh and versions.txt scripts?
##########
File path: r/R/install-arrow.R
##########
@@ -137,3 +136,68 @@ reload_arrow <- function() {
message("Please restart R to use the 'arrow' package.")
}
}
+
+
+#' Download all optional Arrow dependencies
+#'
+#' @param deps_dir Directory to save files into. Will be created if necessary.
+#' Defaults to the value of `ARROW_THIRDPARTY_DEPENDENCY_DIR`, if that
+#' environment variable is set.
+#'
+#' @return `deps_dir`, invisibly
+#'
+#' This function is used for setting up an offline build. If it's possible to
+#' download at build time, don't use this function. Instead, let `cmake`
+#' download them for you.
+#' If the files already exist in `deps_dir`, they will be re-downloaded and
+#' overwritten. Do not put other files in this directory.
+#' These saved files are only used in the build if `ARROW_DEPENDENCY_SOURCE`
+#' is `BUNDLED` or `AUTO`.
+#' https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds
+#'
+#' ## Steps for an offline install with optional dependencies:
+#'
+#' ### On a computer with internet access:
+#' - Install the `arrow` package
Review comment:
Yeah that makes sense. I was hoping to avoid the sound of "to install
arrow, first install arrow".
##########
File path: r/vignettes/install.Rmd
##########
@@ -285,17 +309,28 @@ setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or
(3) uninstalling
the conflicting `zstd`.
See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556).
+* Offline installation fails when dependencies haven't been downloaded to
+`ARROW_THIRDPARTY_DEPENDENCY_DIR`. The package currently depends on the
+third-party project RapidJSON. See `?download_optional_dependencies`.
+See discussion [here](https://issues.apache.org/jira/browse/ARROW-13768) on
Review comment:
We should just solve this rather than document the exception, IMO
##########
File path: r/vignettes/install.Rmd
##########
@@ -342,6 +373,15 @@ By default, these are all unset. All boolean variables are
case-insensitive.
The directory will be created if it does not exist.
* `CMAKE`: When building the C++ library from source, you can specify a
`/path/to/cmake` to use a different version than whatever is found on the
`$PATH`
+* `ARROW_THIRDPARTY_DEPENDENCY_DIR`: Directory with downloaded third-party
+ dependency files. Run `download_optional_dependencies(my-dir)` to download.
+* `TEST_OFFLINE_BUILD`: When set to `true`, the build script will not download
Review comment:
A better place for this would be in the developing.Rmd vignette (we have
another TEST_R_WITHOUT_LIBARROW env var that could also be documented there
too, like this one it's not something a package user would ever want to do)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]