[GitHub] [arrow] paleolimbot commented on a diff in pull request #34708: GH-33287: [R] Cannot read_parquet on http URL

via GitHub Fri, 24 Mar 2023 07:33:14 -0700


paleolimbot commented on code in PR #34708:
URL: https://github.com/apache/arrow/pull/34708#discussion_r1147643721



##########
r/R/io.R:
##########
@@ -239,6 +239,14 @@ make_readable_file <- function(file, mmap = TRUE) {
     path <- sub("/$", "", file$base_path)
     file <- filesystem$OpenInputFile(path)
   } else if (is.string(file)) {
+
+    # if this is a HTTP URL, we need a local copy to pass to 
FileSystem$from_uri
+    if (is_http_url(file)) {
+        tf <- tempfile()
+        download.file(file, tf)

Review Comment:
   ```suggestion
           download.file(file, tf, mode = 'wb')
   ```
   
   For god knows what reason, `mode = 'wb'` is needed on Windows. Also be 
careful with verbosity here (this might print to the console on POSIX or pop up 
a window on Windows).
   
   (Perhaps better to use Arrow's infrastructure to do this? We can already 
create an `InputStream` so you could "just" read that into a file)



##########
r/R/io.R:
##########
@@ -239,6 +239,14 @@ make_readable_file <- function(file, mmap = TRUE) {
     path <- sub("/$", "", file$base_path)
     file <- filesystem$OpenInputFile(path)
   } else if (is.string(file)) {
+
+    # if this is a HTTP URL, we need a local copy to pass to 
FileSystem$from_uri

Review Comment:
   This is only true for Parquet and the band formerly known as Feather: the 
CSV and IPC stream readers can stream their data from a URL and this worked 
already (as far as I know).
   
   Below there's a note `# Try to create a RandomAccessFile first because some 
readers need this`, and I wonder if a better solution is to pass an argument 
`needs_random_access = TRUE`. In that case, the "make a temp file" workaround 
can be invoked where needed.



##########
r/R/io.R:
##########
@@ -239,6 +239,14 @@ make_readable_file <- function(file, mmap = TRUE) {
     path <- sub("/$", "", file$base_path)
     file <- filesystem$OpenInputFile(path)
   } else if (is.string(file)) {
+
+    # if this is a HTTP URL, we need a local copy to pass to 
FileSystem$from_uri
+    if (is_http_url(file)) {
+        tf <- tempfile()

Review Comment:
   This tempfile should be deleted when the read is over. You could do this at 
the R level using `on.exit()` in `read_parquet()` but you'd need some way to 
communicate what exactly needs deleting. Maybe the caller should have to 
provide `temp_file` as an argument (which may or may not be used by 
`make_readable_file()`.



##########
r/tests/testthat/test-parquet.R:
##########
@@ -472,3 +472,10 @@ test_that("Can read parquet with nested lists and maps", {
   pq <- read_parquet(paste0(parquet_test_data, "/nested_maps.snappy.parquet"), 
as_data_frame = FALSE)
   expect_true(pq$a$type == map_of(utf8(), map_of(int32(), field("value", 
boolean(), nullable = FALSE))))
 })
+
+test_that("Can read Parquet files from a URL", {

Review Comment:
   Maybe also pin the commit of this file and `skip_on_cran()`...we don't want 
to get a CRAN ERROR because GitHub decides to go down for a day.



##########
r/tests/testthat/test-parquet.R:
##########
@@ -472,3 +472,10 @@ test_that("Can read parquet with nested lists and maps", {
   pq <- read_parquet(paste0(parquet_test_data, "/nested_maps.snappy.parquet"), 
as_data_frame = FALSE)
   expect_true(pq$a$type == map_of(utf8(), map_of(int32(), field("value", 
boolean(), nullable = FALSE))))
 })
+
+test_that("Can read Parquet files from a URL", {

Review Comment:
   Make sure that you `skip_if_offline()` here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] paleolimbot commented on a diff in pull request #34708: GH-33287: [R] Cannot read_parquet on http URL

Reply via email to