[jira] [Created] (ARROW-15405) [R] Allow all file writers to write from a RecordBatchReader

Dewey Dunnington (Jira) Fri, 21 Jan 2022 05:28:08 -0800

Dewey Dunnington created ARROW-15405:
----------------------------------------


             Summary: [R] Allow all file writers to write from a 
RecordBatchReader
                 Key: ARROW-15405
                 URL: https://issues.apache.org/jira/browse/ARROW-15405
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Dewey Dunnington


After ARROW-14741 we can use a RecordBatchReader as input to a CSV file writer. 
It would be similarly helpful to be able to use a RecordBatchReader as input to 
write Parquet, IPC, and Feather files. The RecordBatchReader is important 
because it can be imported from C/Python/database results (e.g., DuckDB).

There is a workaround (at least for Parquet), but it's quite verbose:

{code:R}
tf <- tempfile(fileext = ".parquet")
df <- data.frame(a = 1:1e6)
arrow::write_parquet(df, tf)

dbdir <- ":memory:"
sink <- tempfile(fileext = ".parquet")

con <- DBI::dbConnect(duckdb::duckdb(dbdir = dbdir))
res <- DBI::dbSendQuery(con, glue::glue_sql("SELECT * FROM { tf }", .con = 
con), arrow = TRUE)

reader <- duckdb::duckdb_fetch_record_batch(res)

# write_parquet() doesn't *quite* support a record batch reader yet
# and we want to stream to the writer to support possibly
# bigger-than-memory results
fs <- arrow::LocalFileSystem$create()
stream <- fs$OpenOutputStream(sink)

chunk_size <- asNamespace("arrow")$calculate_chunk_size(1e6, 1)
batch <- arrow::Table$create(reader$read_next_batch())
writer <- arrow::ParquetFileWriter$create(
  reader$schema,
  stream,
  properties = arrow::ParquetWriterProperties$create(batch)
)
writer$WriteTable(batch, chunk_size = chunk_size)

while (!is.null(batch <- reader$read_next_batch())) {
  writer$WriteTable(arrow::Table$create(batch), chunk_size = chunk_size)
}

writer$Close()
stream$close()
DBI::dbDisconnect(con, shutdown = TRUE)

df2 <- dplyr::arrange(arrow::read_parquet(sink), a)
identical(df2$a, df$a)
#> [1] TRUE
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15405) [R] Allow all file writers to write from a RecordBatchReader

Reply via email to