jonkeane commented on a change in pull request #11001:
URL: https://github.com/apache/arrow/pull/11001#discussion_r702120622



##########
File path: r/R/install-arrow.R
##########
@@ -137,3 +136,91 @@ reload_arrow <- function() {
     message("Please restart R to use the 'arrow' package.")
   }
 }
+
+
+#' Create an install package with all thirdparty dependencies

Review comment:
       ```suggestion
   #' Create an source bundle that includes all thirdparty dependencies
   ```

##########
File path: r/R/install-arrow.R
##########
@@ -137,3 +136,91 @@ reload_arrow <- function() {
     message("Please restart R to use the 'arrow' package.")
   }
 }
+
+
+#' Create an install package with all thirdparty dependencies
+#'
+#' @param dest_file File path for the new tar.gz package. Defaults to
+#' `arrow_V.V.V_with_deps.tar.gz` in the current directory (`V.V.V` is the 
version)
+#' @param source_file File path for the input tar.gz package. Defaults to
+#' downloading the package.
+#' @return The full path to `dest_file`, invisibly
+#'
+#' This function is used for setting up an offline build. If it's possible to
+#' download at build time, don't use this function. Instead, let `cmake`
+#' download the required dependencies for you.
+#' These downloaded dependencies are only used in the build if
+#' `ARROW_DEPENDENCY_SOURCE` is unset, `BUNDLED`, or `AUTO`.
+#' https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds
+#'
+#' ## Steps for an offline install with optional dependencies:
+#'
+#' ### Using a computer with internet access, pre-download the dependencies:
+#' * Install the `arrow` package _or_ run
+#'   
`source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R";)`
+#' * Run `create_package_with_all_dependencies("my_arrow_pkg.tar.gz")`
+#' * Copy the newly created `my_arrow_pkg.tar.gz` to the computer without 
internet access
+#'
+#' ### On the computer without internet access, install the prepared package:
+#' * Install the `arrow` package from the copied file
+#'   * `install.packages("my_arrow_pkg.tar.gz", dependencies = c("Depends", 
"Imports", "LinkingTo"))`
+#'   * This installation will build from source, so `cmake` must be available
+#' * Run [arrow_info()] to check installed capabilities
+#'
+#'
+#' @examples
+#' \dontrun{
+#' new_pkg <- create_package_with_all_dependencies()
+#' # Note: this works when run in the same R session, but it's meant to be
+#' # copied to a different computer.
+#' install.packages(new_pkg, dependencies = c("Depends", "Imports", 
"LinkingTo"))
+#' }
+#' @export
+create_package_with_all_dependencies <- function(dest_file = NULL, source_file 
= NULL) {
+  if (is.null(source_file)) {
+    pkg_download_dir <- tempfile()
+    dir.create(pkg_download_dir)
+    on.exit(unlink(pkg_download_dir, recursive = TRUE), add = TRUE)
+    downloaded <- utils::download.packages("arrow", destdir = 
pkg_download_dir, type = "source")

Review comment:
       This is very minor, but do we want a message here saying that we are 
downloading the file?

##########
File path: r/R/install-arrow.R
##########
@@ -137,3 +136,91 @@ reload_arrow <- function() {
     message("Please restart R to use the 'arrow' package.")
   }
 }
+
+
+#' Create an install package with all thirdparty dependencies
+#'
+#' @param dest_file File path for the new tar.gz package. Defaults to
+#' `arrow_V.V.V_with_deps.tar.gz` in the current directory (`V.V.V` is the 
version)
+#' @param source_file File path for the input tar.gz package. Defaults to
+#' downloading the package.

Review comment:
       ```suggestion
   #' @param source_file File path for the input tar.gz package. Defaults to
   #' downloading the package from CRAN (or whatever you have set as the first 
in `getOption("repos")`).
   ```
   
   In adding this clarification, I realized that if someone has set as their 
first repo RStudio Package Manager, this might do funny things (though, they 
would be getting a binary which should have *most* of everything built already, 
the next steps would either be ignored, or won't work.) Maybe we "just" need to 
document that here and tell people if they are doing that to use the binary 
they get from there.

##########
File path: r/vignettes/install.Rmd
##########
@@ -102,6 +102,37 @@ satisfy C++ dependencies.
 
 > Note that, unlike packages like `tensorflow`, `blogdown`, and others that 
 > require external dependencies, you do not need to run `install_arrow()` 
 > after a successful `arrow` installation.
 
+The `install-arrow.R` file also includes the 
`create_package_with_all_dependencies()`
+function. Normally, when installing on a computer with internet access, the
+build process will download third-party dependencies as needed.
+This function provides a way to download them in advance.
+Doing so may be useful when installing Arrow on a computer without internet 
access.
+Note that Arrow _can_ be installed on a computer without internet access, but

Review comment:
       ```suggestion
   Note that Arrow _can_ be installed on a computer without internet access 
without doing this, but
   ```

##########
File path: r/R/install-arrow.R
##########
@@ -137,3 +136,91 @@ reload_arrow <- function() {
     message("Please restart R to use the 'arrow' package.")
   }
 }
+
+
+#' Create an install package with all thirdparty dependencies
+#'
+#' @param dest_file File path for the new tar.gz package. Defaults to
+#' `arrow_V.V.V_with_deps.tar.gz` in the current directory (`V.V.V` is the 
version)
+#' @param source_file File path for the input tar.gz package. Defaults to
+#' downloading the package.
+#' @return The full path to `dest_file`, invisibly
+#'
+#' This function is used for setting up an offline build. If it's possible to
+#' download at build time, don't use this function. Instead, let `cmake`
+#' download the required dependencies for you.
+#' These downloaded dependencies are only used in the build if
+#' `ARROW_DEPENDENCY_SOURCE` is unset, `BUNDLED`, or `AUTO`.
+#' https://arrow.apache.org/docs/developers/cpp/building.html#offline-builds
+#'
+#' ## Steps for an offline install with optional dependencies:
+#'
+#' ### Using a computer with internet access, pre-download the dependencies:
+#' * Install the `arrow` package _or_ run
+#'   
`source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R";)`
+#' * Run `create_package_with_all_dependencies("my_arrow_pkg.tar.gz")`
+#' * Copy the newly created `my_arrow_pkg.tar.gz` to the computer without 
internet access
+#'
+#' ### On the computer without internet access, install the prepared package:
+#' * Install the `arrow` package from the copied file
+#'   * `install.packages("my_arrow_pkg.tar.gz", dependencies = c("Depends", 
"Imports", "LinkingTo"))`
+#'   * This installation will build from source, so `cmake` must be available
+#' * Run [arrow_info()] to check installed capabilities
+#'
+#'
+#' @examples
+#' \dontrun{
+#' new_pkg <- create_package_with_all_dependencies()
+#' # Note: this works when run in the same R session, but it's meant to be
+#' # copied to a different computer.
+#' install.packages(new_pkg, dependencies = c("Depends", "Imports", 
"LinkingTo"))
+#' }
+#' @export
+create_package_with_all_dependencies <- function(dest_file = NULL, source_file 
= NULL) {

Review comment:
       I'm fine with the order these are in. Generally I like inputs before 
outputs like Neal mentioned, but you're right that for most people 
`source_file` will be left blank.

##########
File path: r/tools/nixlibs.R
##########
@@ -413,66 +423,129 @@ cmake_version <- function(cmd = "cmake") {
   )
 }
 
-with_s3_support <- function(env_vars) {
-  arrow_s3 <- toupper(Sys.getenv("ARROW_S3")) == "ON" || 
tolower(Sys.getenv("LIBARROW_MINIMAL")) == "false"
+turn_off_thirdparty_features <- function(env_var_list) {
+  # Because these are done as environment variables (as opposed to build 
flags),
+  # setting these to "OFF" overrides any previous setting. We don't need to
+  # check the existing value.
+  turn_off <- c(
+    "ARROW_MIMALLOC" = "OFF",
+    "ARROW_JEMALLOC" = "OFF",
+    "ARROW_PARQUET" = "OFF", # depends on thrift
+    "ARROW_DATASET" = "OFF", # depends on parquet
+    "ARROW_S3" = "OFF",
+    "ARROW_WITH_BROTLI" = "OFF",
+    "ARROW_WITH_BZ2" = "OFF",
+    "ARROW_WITH_LZ4" = "OFF",
+    "ARROW_WITH_SNAPPY" = "OFF",
+    "ARROW_WITH_ZLIB" = "OFF",
+    "ARROW_WITH_ZSTD" = "OFF",
+    "ARROW_WITH_RE2" = "OFF",
+    "ARROW_WITH_UTF8PROC" = "OFF",
+    # NOTE: this code sets the environment variable ARROW_JSON to "OFF", but
+    # that setting is will *not* be honored by build_arrow_static.sh until
+    # ARROW-13768 is resolved.

Review comment:
       ```suggestion
   ```
   
   ARROW-13768 is resolved, so we can remove this, yeah?

##########
File path: r/tools/nixlibs.R
##########
@@ -413,66 +423,129 @@ cmake_version <- function(cmd = "cmake") {
   )
 }
 
-with_s3_support <- function(env_vars) {
-  arrow_s3 <- toupper(Sys.getenv("ARROW_S3")) == "ON" || 
tolower(Sys.getenv("LIBARROW_MINIMAL")) == "false"
+turn_off_thirdparty_features <- function(env_var_list) {
+  # Because these are done as environment variables (as opposed to build 
flags),
+  # setting these to "OFF" overrides any previous setting. We don't need to
+  # check the existing value.
+  turn_off <- c(
+    "ARROW_MIMALLOC" = "OFF",
+    "ARROW_JEMALLOC" = "OFF",
+    "ARROW_PARQUET" = "OFF", # depends on thrift
+    "ARROW_DATASET" = "OFF", # depends on parquet
+    "ARROW_S3" = "OFF",
+    "ARROW_WITH_BROTLI" = "OFF",
+    "ARROW_WITH_BZ2" = "OFF",
+    "ARROW_WITH_LZ4" = "OFF",
+    "ARROW_WITH_SNAPPY" = "OFF",
+    "ARROW_WITH_ZLIB" = "OFF",
+    "ARROW_WITH_ZSTD" = "OFF",
+    "ARROW_WITH_RE2" = "OFF",
+    "ARROW_WITH_UTF8PROC" = "OFF",
+    # NOTE: this code sets the environment variable ARROW_JSON to "OFF", but
+    # that setting is will *not* be honored by build_arrow_static.sh until
+    # ARROW-13768 is resolved.
+    "ARROW_JSON" = "OFF",
+    # The syntax to turn off XSIMD is different.
+    # Pull existing value of EXTRA_CMAKE_FLAGS first (must be defined)
+    "EXTRA_CMAKE_FLAGS" = paste(
+      env_var_list[["EXTRA_CMAKE_FLAGS"]],
+      "-DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE"
+    )
+  )
+  # Create a new env_var_list, with the values of turn_off set.
+  # replace() also adds new values if they didn't exist before
+  replace(env_var_list, names(turn_off), turn_off)
+}
+
+set_thirdparty_urls <- function(env_var_list) {
+  # This function does *not* check if existing *_SOURCE_URL variables are set.
+  # The directory tools/thirdparty_dependencies is created by
+  # create_package_with_all_dependencies() and saved in the tar file.
+  files <- list.files(thirdparty_dependency_dir, full.names = FALSE)
+  url_env_varname <- toupper(sub("(.*?)-.*", "ARROW_\\1_URL", files))
+  # Special handling for the aws dependencies, which have extra `-`
+  aws <- grepl("^aws", files)
+  url_env_varname[aws] <- sub(
+    "AWS_SDK_CPP", "AWSSDK",
+    gsub(
+      "-", "_",
+      sub(
+        "(AWS.*)-.*", "ARROW_\\1_URL",
+        toupper(files[aws])
+      )
+    )
+  )
+  full_filenames <- file.path(normalizePath(thirdparty_dependency_dir), files)
+
+  env_var_list <- replace(env_var_list, url_env_varname, full_filenames)
+  if (!quietly) {
+    env_var_list <- replace(env_var_list, "ARROW_VERBOSE_THIRDPARTY_BUILD", 
"ON")
+  }
+  env_var_list
+}
+
+with_mimalloc <- function(env_var_list) {
+  arrow_mimalloc <- env_is("ARROW_MIMALLOC", "on") || 
env_is("LIBARROW_MINIMAL", "false")
+  if (arrow_mimalloc) {

Review comment:
       ```suggestion
     # but if ARROW_MIMALLOC=OFF explicitly, we are definitely off, so override
     if (env_is("ARROW_MIMALLOC", "off")) {
     if (arrow_mimalloc) {
   ```
   
   This wasn't in the original, but like S3 below it, we want to be able to do 
`LIBARROW_MINIMAL=FALSE ARROW_MIMALLOC=OFF` and have everything on but mimalloc 
off. And while we're moving this code around might also well fix this too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to