[ 
https://issues.apache.org/jira/browse/ARROW-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521132#comment-17521132
 ] 

Dewey Dunnington edited comment on ARROW-16144 at 4/12/22 1:05 PM:
-------------------------------------------------------------------

Thank you for catching my error here! I know that we did some compression 
detection but it turns out that's only on read: 
https://github.com/apache/arrow/blob/master/r/R/io.R#L240-L298

You can use {{OpenOutputStream}} and {{CompressedOutputStream}} for any 
filesystem (including S3), although we would need to implement the compression 
detection based on filename for this to "just work" with the .gz suffix:

{code:R}
library(arrow, warn.conflicts = FALSE)

dir <- tempfile()
dir.create(dir)
subdir <- file.path(dir, "bucket")
dir.create(subdir)


minio_server <- processx::process$new("minio", args = c("server", dir), 
supervise = TRUE)
Sys.sleep(1)
stopifnot(minio_server$is_alive())

s3_uri <- 
"s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000"
bucket <- s3_bucket(s3_uri)

data <- data.frame(x = 1:1e4)

out_compressed <- 
CompressedOutputStream$create(bucket$OpenOutputStream("bucket/data.csv.gz"))
write_csv_arrow(data, out_compressed)
out_compressed$close()


out <- bucket$OpenOutputStream("bucket/data.csv")
write_csv_arrow(data, out)
out$close()

file.size(file.path(subdir, "data.csv.gz"))
#> [1] 22627
file.size(file.path(subdir, "data.csv"))
#> [1] 48898

minio_server$interrupt()
#> [1] TRUE
Sys.sleep(1)
stopifnot(!minio_server$is_alive())
{code}



was (Author: paleolimbot):
Thank you for catching my error here! I know that we did some compression 
detection but it turns out that's only on read.

You can use {{OpenOutputStream}} and {{CompressedOutputStream}} for any 
filesystem (including S3), although we would need to implement the compression 
detection based on filename for this to "just work" with the .gz suffix:

{code:R}
library(arrow, warn.conflicts = FALSE)

dir <- tempfile()
dir.create(dir)
subdir <- file.path(dir, "bucket")
dir.create(subdir)


minio_server <- processx::process$new("minio", args = c("server", dir), 
supervise = TRUE)
Sys.sleep(1)
stopifnot(minio_server$is_alive())

s3_uri <- 
"s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000"
bucket <- s3_bucket(s3_uri)

data <- data.frame(x = 1:1e4)

out_compressed <- 
CompressedOutputStream$create(bucket$OpenOutputStream("bucket/data.csv.gz"))
write_csv_arrow(data, out_compressed)
out_compressed$close()


out <- bucket$OpenOutputStream("bucket/data.csv")
write_csv_arrow(data, out)
out$close()

file.size(file.path(subdir, "data.csv.gz"))
#> [1] 22627
file.size(file.path(subdir, "data.csv"))
#> [1] 48898

minio_server$interrupt()
#> [1] TRUE
Sys.sleep(1)
stopifnot(!minio_server$is_alive())
{code}


> [R] Write compressed data streams (particularly over S3)
> --------------------------------------------------------
>
>                 Key: ARROW-16144
>                 URL: https://issues.apache.org/jira/browse/ARROW-16144
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 7.0.0
>            Reporter: Carl Boettiger
>            Priority: Major
>
> The python bindings have `CompressedOutputStream`, but  I don't see how we 
> can do this on the R side (e.g. with `write_csv_arrow()`).  It would be 
> wonderful if we could both read and write compressed streams, particularly 
> for CSV and particularly for remote filesystems, where this can provide 
> considerable performance improvements.  
> (For comparison, readr will write a compressed stream automatically based on 
> the extension for the given filename, e.g. `readr::write_csv(data, 
> "file.csv.gz")` or `write_csv("data.file.xz")`  )



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to