jonkeane commented on a change in pull request #11001: URL: https://github.com/apache/arrow/pull/11001#discussion_r702120622
########## File path: r/R/install-arrow.R ########## @@ -137,3 +136,91 @@ reload_arrow <- function() { message("Please restart R to use the 'arrow' package.") } } + + +#' Create an install package with all thirdparty dependencies Review comment: ```suggestion #' Create an source bundle that includes all thirdparty dependencies ``` ########## File path: r/R/install-arrow.R ########## @@ -137,3 +136,91 @@ reload_arrow <- function() { message("Please restart R to use the 'arrow' package.") } } + + +#' Create an install package with all thirdparty dependencies +#' +#' @param dest_file File path for the new tar.gz package. Defaults to +#' `arrow_V.V.V_with_deps.tar.gz` in the current directory (`V.V.V` is the version) +#' @param source_file File path for the input tar.gz package. Defaults to +#' downloading the package. +#' @return The full path to `dest_file`, invisibly +#' +#' This function is used for setting up an offline build. If it's possible to +#' download at build time, don't use this function. Instead, let `cmake` +#' download the required dependencies for you. +#' These downloaded dependencies are only used in the build if +#' `ARROW_DEPENDENCY_SOURCE` is unset, `BUNDLED`, or `AUTO`. +#' https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds +#' +#' ## Steps for an offline install with optional dependencies: +#' +#' ### Using a computer with internet access, pre-download the dependencies: +#' * Install the `arrow` package _or_ run +#' `source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")` +#' * Run `create_package_with_all_dependencies("my_arrow_pkg.tar.gz")` +#' * Copy the newly created `my_arrow_pkg.tar.gz` to the computer without internet access +#' +#' ### On the computer without internet access, install the prepared package: +#' * Install the `arrow` package from the copied file +#' * `install.packages("my_arrow_pkg.tar.gz", dependencies = c("Depends", "Imports", "LinkingTo"))` +#' * This installation will build from source, so `cmake` must be available +#' * Run [arrow_info()] to check installed capabilities +#' +#' +#' @examples +#' \dontrun{ +#' new_pkg <- create_package_with_all_dependencies() +#' # Note: this works when run in the same R session, but it's meant to be +#' # copied to a different computer. +#' install.packages(new_pkg, dependencies = c("Depends", "Imports", "LinkingTo")) +#' } +#' @export +create_package_with_all_dependencies <- function(dest_file = NULL, source_file = NULL) { + if (is.null(source_file)) { + pkg_download_dir <- tempfile() + dir.create(pkg_download_dir) + on.exit(unlink(pkg_download_dir, recursive = TRUE), add = TRUE) + downloaded <- utils::download.packages("arrow", destdir = pkg_download_dir, type = "source") Review comment: This is very minor, but do we want a message here saying that we are downloading the file? ########## File path: r/R/install-arrow.R ########## @@ -137,3 +136,91 @@ reload_arrow <- function() { message("Please restart R to use the 'arrow' package.") } } + + +#' Create an install package with all thirdparty dependencies +#' +#' @param dest_file File path for the new tar.gz package. Defaults to +#' `arrow_V.V.V_with_deps.tar.gz` in the current directory (`V.V.V` is the version) +#' @param source_file File path for the input tar.gz package. Defaults to +#' downloading the package. Review comment: ```suggestion #' @param source_file File path for the input tar.gz package. Defaults to #' downloading the package from CRAN (or whatever you have set as the first in `getOption("repos")`). ``` In adding this clarification, I realized that if someone has set as their first repo RStudio Package Manager, this might do funny things (though, they would be getting a binary which should have *most* of everything built already, the next steps would either be ignored, or won't work.) Maybe we "just" need to document that here and tell people if they are doing that to use the binary they get from there. ########## File path: r/vignettes/install.Rmd ########## @@ -102,6 +102,37 @@ satisfy C++ dependencies. > Note that, unlike packages like `tensorflow`, `blogdown`, and others that > require external dependencies, you do not need to run `install_arrow()` > after a successful `arrow` installation. +The `install-arrow.R` file also includes the `create_package_with_all_dependencies()` +function. Normally, when installing on a computer with internet access, the +build process will download third-party dependencies as needed. +This function provides a way to download them in advance. +Doing so may be useful when installing Arrow on a computer without internet access. +Note that Arrow _can_ be installed on a computer without internet access, but Review comment: ```suggestion Note that Arrow _can_ be installed on a computer without internet access without doing this, but ``` ########## File path: r/R/install-arrow.R ########## @@ -137,3 +136,91 @@ reload_arrow <- function() { message("Please restart R to use the 'arrow' package.") } } + + +#' Create an install package with all thirdparty dependencies +#' +#' @param dest_file File path for the new tar.gz package. Defaults to +#' `arrow_V.V.V_with_deps.tar.gz` in the current directory (`V.V.V` is the version) +#' @param source_file File path for the input tar.gz package. Defaults to +#' downloading the package. +#' @return The full path to `dest_file`, invisibly +#' +#' This function is used for setting up an offline build. If it's possible to +#' download at build time, don't use this function. Instead, let `cmake` +#' download the required dependencies for you. +#' These downloaded dependencies are only used in the build if +#' `ARROW_DEPENDENCY_SOURCE` is unset, `BUNDLED`, or `AUTO`. +#' https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds +#' +#' ## Steps for an offline install with optional dependencies: +#' +#' ### Using a computer with internet access, pre-download the dependencies: +#' * Install the `arrow` package _or_ run +#' `source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")` +#' * Run `create_package_with_all_dependencies("my_arrow_pkg.tar.gz")` +#' * Copy the newly created `my_arrow_pkg.tar.gz` to the computer without internet access +#' +#' ### On the computer without internet access, install the prepared package: +#' * Install the `arrow` package from the copied file +#' * `install.packages("my_arrow_pkg.tar.gz", dependencies = c("Depends", "Imports", "LinkingTo"))` +#' * This installation will build from source, so `cmake` must be available +#' * Run [arrow_info()] to check installed capabilities +#' +#' +#' @examples +#' \dontrun{ +#' new_pkg <- create_package_with_all_dependencies() +#' # Note: this works when run in the same R session, but it's meant to be +#' # copied to a different computer. +#' install.packages(new_pkg, dependencies = c("Depends", "Imports", "LinkingTo")) +#' } +#' @export +create_package_with_all_dependencies <- function(dest_file = NULL, source_file = NULL) { Review comment: I'm fine with the order these are in. Generally I like inputs before outputs like Neal mentioned, but you're right that for most people `source_file` will be left blank. ########## File path: r/tools/nixlibs.R ########## @@ -413,66 +423,129 @@ cmake_version <- function(cmd = "cmake") { ) } -with_s3_support <- function(env_vars) { - arrow_s3 <- toupper(Sys.getenv("ARROW_S3")) == "ON" || tolower(Sys.getenv("LIBARROW_MINIMAL")) == "false" +turn_off_thirdparty_features <- function(env_var_list) { + # Because these are done as environment variables (as opposed to build flags), + # setting these to "OFF" overrides any previous setting. We don't need to + # check the existing value. + turn_off <- c( + "ARROW_MIMALLOC" = "OFF", + "ARROW_JEMALLOC" = "OFF", + "ARROW_PARQUET" = "OFF", # depends on thrift + "ARROW_DATASET" = "OFF", # depends on parquet + "ARROW_S3" = "OFF", + "ARROW_WITH_BROTLI" = "OFF", + "ARROW_WITH_BZ2" = "OFF", + "ARROW_WITH_LZ4" = "OFF", + "ARROW_WITH_SNAPPY" = "OFF", + "ARROW_WITH_ZLIB" = "OFF", + "ARROW_WITH_ZSTD" = "OFF", + "ARROW_WITH_RE2" = "OFF", + "ARROW_WITH_UTF8PROC" = "OFF", + # NOTE: this code sets the environment variable ARROW_JSON to "OFF", but + # that setting is will *not* be honored by build_arrow_static.sh until + # ARROW-13768 is resolved. Review comment: ```suggestion ``` ARROW-13768 is resolved, so we can remove this, yeah? ########## File path: r/tools/nixlibs.R ########## @@ -413,66 +423,129 @@ cmake_version <- function(cmd = "cmake") { ) } -with_s3_support <- function(env_vars) { - arrow_s3 <- toupper(Sys.getenv("ARROW_S3")) == "ON" || tolower(Sys.getenv("LIBARROW_MINIMAL")) == "false" +turn_off_thirdparty_features <- function(env_var_list) { + # Because these are done as environment variables (as opposed to build flags), + # setting these to "OFF" overrides any previous setting. We don't need to + # check the existing value. + turn_off <- c( + "ARROW_MIMALLOC" = "OFF", + "ARROW_JEMALLOC" = "OFF", + "ARROW_PARQUET" = "OFF", # depends on thrift + "ARROW_DATASET" = "OFF", # depends on parquet + "ARROW_S3" = "OFF", + "ARROW_WITH_BROTLI" = "OFF", + "ARROW_WITH_BZ2" = "OFF", + "ARROW_WITH_LZ4" = "OFF", + "ARROW_WITH_SNAPPY" = "OFF", + "ARROW_WITH_ZLIB" = "OFF", + "ARROW_WITH_ZSTD" = "OFF", + "ARROW_WITH_RE2" = "OFF", + "ARROW_WITH_UTF8PROC" = "OFF", + # NOTE: this code sets the environment variable ARROW_JSON to "OFF", but + # that setting is will *not* be honored by build_arrow_static.sh until + # ARROW-13768 is resolved. + "ARROW_JSON" = "OFF", + # The syntax to turn off XSIMD is different. + # Pull existing value of EXTRA_CMAKE_FLAGS first (must be defined) + "EXTRA_CMAKE_FLAGS" = paste( + env_var_list[["EXTRA_CMAKE_FLAGS"]], + "-DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE" + ) + ) + # Create a new env_var_list, with the values of turn_off set. + # replace() also adds new values if they didn't exist before + replace(env_var_list, names(turn_off), turn_off) +} + +set_thirdparty_urls <- function(env_var_list) { + # This function does *not* check if existing *_SOURCE_URL variables are set. + # The directory tools/thirdparty_dependencies is created by + # create_package_with_all_dependencies() and saved in the tar file. + files <- list.files(thirdparty_dependency_dir, full.names = FALSE) + url_env_varname <- toupper(sub("(.*?)-.*", "ARROW_\\1_URL", files)) + # Special handling for the aws dependencies, which have extra `-` + aws <- grepl("^aws", files) + url_env_varname[aws] <- sub( + "AWS_SDK_CPP", "AWSSDK", + gsub( + "-", "_", + sub( + "(AWS.*)-.*", "ARROW_\\1_URL", + toupper(files[aws]) + ) + ) + ) + full_filenames <- file.path(normalizePath(thirdparty_dependency_dir), files) + + env_var_list <- replace(env_var_list, url_env_varname, full_filenames) + if (!quietly) { + env_var_list <- replace(env_var_list, "ARROW_VERBOSE_THIRDPARTY_BUILD", "ON") + } + env_var_list +} + +with_mimalloc <- function(env_var_list) { + arrow_mimalloc <- env_is("ARROW_MIMALLOC", "on") || env_is("LIBARROW_MINIMAL", "false") + if (arrow_mimalloc) { Review comment: ```suggestion # but if ARROW_MIMALLOC=OFF explicitly, we are definitely off, so override if (env_is("ARROW_MIMALLOC", "off")) { if (arrow_mimalloc) { ``` This wasn't in the original, but like S3 below it, we want to be able to do `LIBARROW_MINIMAL=FALSE ARROW_MIMALLOC=OFF` and have everything on but mimalloc off. And while we're moving this code around might also well fix this too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org