paleolimbot commented on code in PR #34708:
URL: https://github.com/apache/arrow/pull/34708#discussion_r1147643721
##########
r/R/io.R:
##########
@@ -239,6 +239,14 @@ make_readable_file <- function(file, mmap = TRUE) {
path <- sub("/$", "", file$base_path)
file <- filesystem$OpenInputFile(path)
} else if (is.string(file)) {
+
+ # if this is a HTTP URL, we need a local copy to pass to
FileSystem$from_uri
+ if (is_http_url(file)) {
+ tf <- tempfile()
+ download.file(file, tf)
Review Comment:
```suggestion
download.file(file, tf, mode = 'wb')
```
For god knows what reason, `mode = 'wb'` is needed on Windows. Also be
careful with verbosity here (this might print to the console on POSIX or pop up
a window on Windows).
(Perhaps better to use Arrow's infrastructure to do this? We can already
create an `InputStream` so you could "just" read that into a file)
##########
r/R/io.R:
##########
@@ -239,6 +239,14 @@ make_readable_file <- function(file, mmap = TRUE) {
path <- sub("/$", "", file$base_path)
file <- filesystem$OpenInputFile(path)
} else if (is.string(file)) {
+
+ # if this is a HTTP URL, we need a local copy to pass to
FileSystem$from_uri
Review Comment:
This is only true for Parquet and the band formerly known as Feather: the
CSV and IPC stream readers can stream their data from a URL and this worked
already (as far as I know).
Below there's a note `# Try to create a RandomAccessFile first because some
readers need this`, and I wonder if a better solution is to pass an argument
`needs_random_access = TRUE`. In that case, the "make a temp file" workaround
can be invoked where needed.
##########
r/R/io.R:
##########
@@ -239,6 +239,14 @@ make_readable_file <- function(file, mmap = TRUE) {
path <- sub("/$", "", file$base_path)
file <- filesystem$OpenInputFile(path)
} else if (is.string(file)) {
+
+ # if this is a HTTP URL, we need a local copy to pass to
FileSystem$from_uri
+ if (is_http_url(file)) {
+ tf <- tempfile()
Review Comment:
This tempfile should be deleted when the read is over. You could do this at
the R level using `on.exit()` in `read_parquet()` but you'd need some way to
communicate what exactly needs deleting. Maybe the caller should have to
provide `temp_file` as an argument (which may or may not be used by
`make_readable_file()`.
##########
r/tests/testthat/test-parquet.R:
##########
@@ -472,3 +472,10 @@ test_that("Can read parquet with nested lists and maps", {
pq <- read_parquet(paste0(parquet_test_data, "/nested_maps.snappy.parquet"),
as_data_frame = FALSE)
expect_true(pq$a$type == map_of(utf8(), map_of(int32(), field("value",
boolean(), nullable = FALSE))))
})
+
+test_that("Can read Parquet files from a URL", {
Review Comment:
Maybe also pin the commit of this file and `skip_on_cran()`...we don't want
to get a CRAN ERROR because GitHub decides to go down for a day.
##########
r/tests/testthat/test-parquet.R:
##########
@@ -472,3 +472,10 @@ test_that("Can read parquet with nested lists and maps", {
pq <- read_parquet(paste0(parquet_test_data, "/nested_maps.snappy.parquet"),
as_data_frame = FALSE)
expect_true(pq$a$type == map_of(utf8(), map_of(int32(), field("value",
boolean(), nullable = FALSE))))
})
+
+test_that("Can read Parquet files from a URL", {
Review Comment:
Make sure that you `skip_if_offline()` here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]