jllipatz opened a new issue, #34820:
URL: https://github.com/apache/arrow/issues/34820
### Describe the usage question you have. Please include as many useful
details as possible.
Hello
I am using the arrow R package version 11.0.0.3
I work with a large parquet file (around 55 GO when loaded in memory). I
want to build another parquet file with an additionnal column computed from a
join with a small table. I have several solutions with duckdb and I am trying
to build one using arrow alone. The following code leads to an abnormal memory
consumption (76 GO) when it starts the block after the call `as_record_batch`
as if the reader needs the whole result to be in memory.
That is not the case with the duckdb solution for which the used memory
varies during the process but doesn't go above 17 GO.
The duration of the two versions are very similar.
Additionnally the chunksize used by the arrow version is very small, is
there a way to improve the making of the parquet file?
`
library(tictoc)
library(arrow)
library(dplyr)
dep <- rio::import('V:/PALETTES/IGoR/data/dep2014.dbf')
ds <- open_dataset('V:/PALETTES/parquet/rp68a19.parquet')
tic()
reader <- ds %>%
left_join(dep,by=c("DR"="DEP")) %>%
as_record_batch_reader()
file <- FileOutputStream$create('V:/PALETTES/tmp/rp68a19c2.parquet')
batch <- reader$read_next_batch()
if (!is.null(batch)) {
s <- batch$schema
writer <- ParquetFileWriter$create(s,file,
properties = ParquetWriterProperties$create(names(s)))
i <- 0
while (!is.null(batch)) {
i <- i+1
message(sprintf("%d, %d rows",i,nrow(batch)))
writer$WriteTable(arrow_table(batch),chunk_size=1e6)
batch <- reader$read_next_batch()
}
writer$Close()
}
file$close()
toc()`
The code with duckdb:
`
library(DBI)
library(arrow)
library(duckdb)
library(tictoc)
con <- dbConnect(duckdb::duckdb())
tic()
reader <- duckdb_fetch_record_batch(
dbSendQuery(con,"
SELECT a.*,b.REGION
FROM 'V:/PALETTES/parquet/rp68a19.parquet' a
LEFT JOIN 'V:/PALETTES/SQL/data/dep2014.parquet' b
ON a.DR=b.DEP
", arrow=TRUE))
file <- FileOutputStream$create('V:/PALETTES/tmp/rp68a19d.parquet')
batch <- reader$read_next_batch()
if (!is.null(batch)) {
s <- batch$schema
writer <- ParquetFileWriter$create(s,file,
properties = ParquetWriterProperties$create(names(s)))
i <- 0
while (!is.null(batch)) {
i <- i+1
message(sprintf("%d, %d rows",i,nrow(batch)))
writer$WriteTable(arrow_table(batch),chunk_size=1e6)
batch <- reader$read_next_batch()
}
writer$Close()
}
file$close()
toc() `
### Component(s)
Parquet, R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]