(arrow) branch main updated: GH-42143: [R] Sanitize R metadata (#41969)

npr Fri, 14 Jun 2024 13:09:45 -0700

This is an automated email from the ASF dual-hosted git repository.

npr pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git



The following commit(s) were added to refs/heads/main by this push:
     new 801de2fbcf GH-42143: [R] Sanitize R metadata (#41969)
801de2fbcf is described below

commit 801de2fbcf5bcbce0c019ed4b35ff3fc863b141b
Author: Neal Richardson <[email protected]>
AuthorDate: Fri Jun 14 16:09:34 2024 -0400

    GH-42143: [R] Sanitize R metadata (#41969)
    
    ### Rationale for this change
    
    `arrow` uses R `serialize()`/`unserialize()` to store additional
    metadata in the Arrow schema. This PR adds some extra checking and
    sanitizing in order to make the reading of this metadata robust to data
    of unknown provenance.
    
    ### What changes are included in this PR?
    
    * When writing metadata, we strip out all but simple types: strings,
    numbers, boolean, lists, etc. Objects of other types, such as
    environments, external pointers, and other language types, are removed.
    * When reading metadata, the same filter is applied. If there are types
    that are not in the allowlist, one of two things happen. By default,
    they are removed with a warning. If you set
    `options(arrow.unsafe_metadata = TRUE)`, the full metadata including
    disallowed types is returned, also with a warning. This option is an
    escape hatch in case we are too strict with dropping types when reading
    files produced by older versions of the package that did not filter them
    out.
    * `unserialize()` is called in a way that prevents promises contained in
    the data from being automatically invoked. This technique works on all
    versions of R: it is not dependent on the patch for RDS reading that was
    included in 4.4.
    * Other sanity checking to be stricter about only reading back in
    something of the form we wrote out: assert that the data is
    ASCII-serialized, and if it is compressed, it is gzip, the same way we
    do on serialization. It's not clear that it's necessary, but it's not
    bad to be extra strict here.
    
    ### Are these changes tested?
    
    Yes
    
    ### Are there any user-facing changes?
    
    For most, no. But:
    
    **This PR contains a "Critical Fix".**
    
    Without this patch, it is possible to construct an Arrow or Parquet file
    that would contain code that would execute when the R metadata is
    applied when converting to a data.frame. If you are using an older
    version of the package and are reading data from a source you do not
    trust, you can read into a `Table` and use its internal
    `$to_data_frame()` method, like `read_parquet(..., as_data_frame =
    FALSE)$to_data_frame()`. This should skip the reading of the R metadata.
    * GitHub Issue: #42143
---
 r/NEWS.md                        |   3 +-
 r/R/extension.R                  |   2 +-
 r/R/metadata.R                   | 109 +++++++++++++++++++++++++++++++++++----
 r/tests/testthat/test-metadata.R |  69 +++++++++++++++++++++++++
 4 files changed, 169 insertions(+), 14 deletions(-)

diff --git a/r/NEWS.md b/r/NEWS.md
index dc89fa266e..317e546a1b 100644
--- a/r/NEWS.md
+++ b/r/NEWS.md
@@ -19,11 +19,10 @@
 
 # arrow 16.1.0.9000
 
-# arrow 16.1.0
-
 * R functions that users write that use functions that Arrow supports in 
dataset queries now can be used in queries too. Previously, only functions that 
used arithmetic operators worked. For example, `time_hours <- function(mins) 
mins / 60` worked, but `time_hours_rounded <- function(mins) round(mins / 60)` 
did not; now both work. These are automatic translations rather than true 
user-defined functions (UDFs); for UDFs, see `register_scalar_function()`. 
(#41223)
 * `summarize()` supports more complex expressions, and correctly handles cases 
where column names are reused in expressions. 
 * The `na_matches` argument to the `dplyr::*_join()` functions is now 
supported. This argument controls whether `NA` values are considered equal when 
joining. (#41358)
+* R metadata, stored in the Arrow schema to support round-tripping data 
between R and Arrow/Parquet, is now serialized and deserialized more strictly. 
This makes it safer to load data from files from unknown sources into R 
data.frames. (#41969)
 
 # arrow 16.1.0
 
diff --git a/r/R/extension.R b/r/R/extension.R
index 59a02121fd..3529144e11 100644
--- a/r/R/extension.R
+++ b/r/R/extension.R
@@ -429,7 +429,7 @@ VctrsExtensionType <- R6Class("VctrsExtensionType",
       paste0(capture.output(print(self$ptype())), collapse = "\n")
     },
     deserialize_instance = function() {
-      private$.ptype <- unserialize(self$extension_metadata())
+      private$.ptype <- 
safe_r_metadata(safe_unserialize(self$extension_metadata()))
     },
     ExtensionEquals = function(other) {
       inherits(other, "VctrsExtensionType") && identical(self$ptype(), 
other$ptype())
diff --git a/r/R/metadata.R b/r/R/metadata.R
index 3ae2db4eaa..ba73f08578 100644
--- a/r/R/metadata.R
+++ b/r/R/metadata.R
@@ -30,7 +30,7 @@
     }
   }
 
-  out <- serialize(x, NULL, ascii = TRUE)
+  out <- serialize(safe_r_metadata(x, on_save = TRUE), NULL, ascii = TRUE)
 
   # if the metadata is over 100 kB, compress
   if (option_compress_metadata() && object.size(out) > 100000) {
@@ -44,23 +44,110 @@
 }
 
 .deserialize_arrow_r_metadata <- function(x) {
-  tryCatch(
-    expr = {
-      out <- unserialize(charToRaw(x))
-
-      # if this is still raw, try decompressing
-      if (is.raw(out)) {
-        out <- unserialize(memDecompress(out, type = "gzip"))
-      }
-      out
-    },
+  tryCatch(unserialize_r_metadata(x),
     error = function(e) {
+      if (getOption("arrow.debug", FALSE)) {
+        print(conditionMessage(e))
+      }
       warning("Invalid metadata$r", call. = FALSE)
       NULL
     }
   )
 }
 
+unserialize_r_metadata <- function(x) {
+  # Check that this is ASCII serialized data (as in, what we wrote)
+  if (!identical(substr(unclass(x), 1, 1), "A")) {
+    stop("Invalid serialized data")
+  }
+  out <- safe_unserialize(charToRaw(x))
+  # If it's still raw, decompress and unserialize again
+  if (is.raw(out)) {
+    decompressed <- memDecompress(out, type = "gzip")
+    if (!identical(rawToChar(decompressed[1]), "A")) {
+      stop("Invalid serialized compressed data")
+    }
+    out <- safe_unserialize(decompressed)
+  }
+  if (!is.list(out)) {
+    stop("Invalid serialized data: must be a list")
+  }
+  safe_r_metadata(out)
+}
+
+safe_unserialize <- function(x) {
+  # By capturing the data in a list, we can inspect it for promises without
+  # triggering their evaluation.
+  out <- list(unserialize(x))
+  if (typeof(out[[1]]) == "promise") {
+    stop("Serialized data contains a promise object")
+  }
+  out[[1]]
+}
+
+safe_r_metadata <- function(metadata, on_save = FALSE) {
+  # This function recurses through the metadata list and checks that all
+  # elements are of types that are allowed in R metadata.
+  # If it finds an element that is not allowed, it removes it.
+  #
+  # This function is used both when saving and loading metadata.
+  # @param on_save: If TRUE, the function will not warn if it removes elements:
+  # we're just cleaning up the metadata for saving. If FALSE, it means we're
+  # loading the metadata, and we'll warn if we find invalid elements.
+  #
+  # When loading metadata, you can optionally keep the invalid elements by
+  # setting `options(arrow.unsafe_metadata = TRUE)`. It will still check
+  # for invalid elements and warn if any are found, though.
+
+  # This variable will be used to store the types of elements that were 
removed,
+  # if any, so we can give an informative warning if needed.
+  types_removed <- c()
+
+  # Internal function that we'll recursively apply,
+  # and mutate the `types_removed` variable outside of it.
+  check_r_metadata_types_recursive <- function(x) {
+    allowed_types <- c("character", "double", "integer", "logical", "complex", 
"list", "NULL")
+    if (is.list(x)) {
+      types <- map_chr(x, typeof)
+      x[types == "list"] <- map(x[types == "list"], 
check_r_metadata_types_recursive)
+      ok <- types %in% allowed_types
+      if (!all(ok)) {
+        # Record the invalid types, then remove the offending elements
+        types_removed <<- c(types_removed, setdiff(types, allowed_types))
+        x <- x[ok]
+      }
+    }
+    x
+  }
+  new <- check_r_metadata_types_recursive(metadata)
+
+  # On save: don't warn, just save the filtered metadata
+  if (on_save) {
+    return(new)
+  }
+  # On load: warn if any elements were removed
+  if (length(types_removed)) {
+    types_msg <- paste("Type:", oxford_paste(unique(types_removed)))
+    if (getOption("arrow.unsafe_metadata", FALSE)) {
+      # We've opted-in to unsafe metadata, so warn but return the original 
metadata
+      rlang::warn(
+        "R metadata may have unsafe or invalid elements",
+        body = c("i" = types_msg)
+      )
+      new <- metadata
+    } else {
+      rlang::warn(
+        "Potentially unsafe or invalid elements have been discarded from R 
metadata.",
+        body = c(
+          "i" = types_msg,
+          ">" = "If you trust the source, you can set 
`options(arrow.unsafe_metadata = TRUE)` to preserve them."
+        )
+      )
+    }
+  }
+  new
+}
+
 #' @importFrom rlang trace_back
 apply_arrow_r_metadata <- function(x, r_metadata) {
   if (is.null(r_metadata)) {
diff --git a/r/tests/testthat/test-metadata.R b/r/tests/testthat/test-metadata.R
index e44cd71038..175e7ef3b6 100644
--- a/r/tests/testthat/test-metadata.R
+++ b/r/tests/testthat/test-metadata.R
@@ -107,6 +107,73 @@ test_that("Garbage R metadata doesn't break things", {
     "Invalid metadata$r",
     fixed = TRUE
   )
+
+  bad <- new.env(parent = emptyenv())
+  makeActiveBinding("columns", function() stop("This should not run"), bad)
+  tab$metadata <- list(r = rawToChar(serialize(bad, NULL, ascii = TRUE)))
+  expect_warning(
+    as.data.frame(tab),
+    "Invalid metadata$r",
+    fixed = TRUE
+  )
+
+  # https://hiddenlayer.com/research/r-bitrary-code-execution/
+  tab$metadata <- list(r = "A
+3
+262913
+197888
+5
+UTF-8
+5
+252
+6
+1
+262153
+7
+message
+2
+16
+1
+262153
+32
+arbitrary\040code\040was\040just\040executed
+254
+")
+  expect_message(
+    expect_warning(
+      as.data.frame(tab),
+      "Invalid metadata$r",
+      fixed = TRUE
+    ),
+    NA
+  )
+})
+
+test_that("Complex or unsafe attributes are pruned from R metadata, if they 
exist", {
+  tab <- Table$create(example_data[1:6])
+  bad <- new.env()
+  makeActiveBinding("class", function() stop("This should not run"), bad)
+  tab$metadata <- list(r = rawToChar(serialize(list(attributes = bad), NULL, 
ascii = TRUE)))
+  expect_warning(
+    as.data.frame(tab),
+    "Potentially unsafe or invalid elements have been discarded from R 
metadata.
+i Type: \"environment\"
+> If you trust the source, you can set `options(arrow.unsafe_metadata = TRUE)` 
to preserve them.",
+    fixed = TRUE
+  )
+  # You can set an option to allow them through.
+  # It still warns, just differently, and it doesn't prune the attributes
+  withr::local_options(list("arrow.unsafe_metadata" = TRUE))
+  expect_warning(
+    expect_warning(
+      as.data.frame(tab),
+      "R metadata may have unsafe or invalid elements
+i Type: \"environment\""
+    ),
+    # This particular example ultimately fails because it's not a list
+    "Invalid metadata$r",
+    fixed = TRUE
+  )
 })
 
 test_that("Metadata serialization compression", {
@@ -254,6 +321,8 @@ test_that("Row-level metadata (does not) roundtrip in 
datasets", {
   skip_if_not_available("dataset")
   skip_if_not_available("parquet")
 
+  library(dplyr, warn.conflicts = FALSE)
+
   df <- tibble::tibble(
     metadata = list(
       structure(1, my_value_as_attr = 1),

(arrow) branch main updated: GH-42143: [R] Sanitize R metadata (#41969)

Reply via email to