[jira] [Created] (ARROW-11756) [R] passing a partition as a schema leads to segfaults

Jonathan Keane (Jira) Tue, 23 Feb 2021 15:00:06 -0800

Jonathan Keane created ARROW-11756:
--------------------------------------

             Summary: [R] passing a partition as a schema leads to segfaults
                 Key: ARROW-11756
                 URL: https://issues.apache.org/jira/browse/ARROW-11756
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
            Reporter: Jonathan Keane



[The command to open a dataset in 
R|https://arrow.apache.org/docs/r/reference/open_dataset.html] can accept both 
a schema and a partitioning argument. If one accidentally passes a partitioning 
as the schema, the result looks like the dataset was read, but operating on the 
dataset results in segfaults after.

Though this is input error, we should add a validation checking that the schema 
argument is, in fact, a {{Schema}} object and error if it is not so that 
someone doesn't find themselves confronted with a segfault later.

{code:r}
### begin setup 
# note: this exact code is called in test-dataset.R lines 18-87) So when adding
# the test to that file, you don't need to copy this, but can use the code at
# the bottom of this chunk in that test if you want.
library(dplyr)

make_temp_dir <- function() {
  path <- tempfile()
  dir.create(path)
  normalizePath(path, winslash = "/")
}

hive_dir <- make_temp_dir()

first_date <- lubridate::ymd_hms("2015-04-29 03:12:39")
df1 <- tibble(
  int = 1:10,
  dbl = as.numeric(1:10),
  lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
  chr = letters[1:10],
  fct = factor(LETTERS[1:10]),
  ts = first_date + lubridate::days(1:10)
)

second_date <- lubridate::ymd_hms("2017-03-09 07:01:02")
df2 <- tibble(
  int = 101:110,
  dbl = c(as.numeric(51:59), NaN),
  lgl = rep(c(TRUE, FALSE, NA, TRUE, FALSE), 2),
  chr = letters[10:1],
  fct = factor(LETTERS[10:1]),
  ts = second_date + lubridate::days(10:1)
)

dir.create(file.path(hive_dir, "subdir", "group=1", "other=xxx"), recursive = 
TRUE)
dir.create(file.path(hive_dir, "subdir", "group=2", "other=yyy"), recursive = 
TRUE)
write_parquet(df1, file.path(hive_dir, "subdir", "group=1", "other=xxx", 
"file1.parquet"))
write_parquet(df2, file.path(hive_dir, "subdir", "group=2", "other=yyy", 
"file2.parquet"))

### end setup

# This (the correct specification) works just fine
ds <- open_dataset(hive_dir, partitioning = hive_partition(other = utf8(), 
group = uint8()))
ds$schema

# But if you aren't explicit with ther argument names it looks like everything 
works...
ds <- open_dataset(hive_dir, hive_partition(other = utf8(), group = uint8()))

# but the dataset is malformed and will have segfaults when trying to interact 
with it for example:
ds$schema
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11756) [R] passing a partition as a schema leads to segfaults

Reply via email to