jllipatz opened a new issue, #34820:
URL: https://github.com/apache/arrow/issues/34820

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hello
   
   I am using the arrow R package version 11.0.0.3
   
   I work with a large parquet file (around 55 GO when loaded in memory). I 
want to build another parquet file with an additionnal column computed from a 
join with a small table. I have several solutions with duckdb and I am trying 
to build one using arrow alone. The following code leads to an abnormal memory 
consumption (76 GO) when it starts the block after the call `as_record_batch` 
as if the reader needs the whole result to be in memory.
   That is not the case with the duckdb solution for which the used memory 
varies during the process but doesn't go above 17 GO.
   The duration of the two versions are very similar.
    
   Additionnally the chunksize used by the arrow version is very small, is 
there a way to improve the making of the parquet file?
   
   
   `
   library(tictoc)
   library(arrow)
   library(dplyr)
   
   dep <- rio::import('V:/PALETTES/IGoR/data/dep2014.dbf')
   
   ds <- open_dataset('V:/PALETTES/parquet/rp68a19.parquet')
   tic()
   
   reader <- ds %>%
     left_join(dep,by=c("DR"="DEP")) %>%
     as_record_batch_reader()
     
   file <- FileOutputStream$create('V:/PALETTES/tmp/rp68a19c2.parquet')
   batch <- reader$read_next_batch()
   if (!is.null(batch)) {
     s <- batch$schema
     writer <- ParquetFileWriter$create(s,file,
            properties = ParquetWriterProperties$create(names(s)))
   
     i <- 0
     while (!is.null(batch)) {
       i <- i+1
       message(sprintf("%d, %d rows",i,nrow(batch)))
       writer$WriteTable(arrow_table(batch),chunk_size=1e6)
       batch <- reader$read_next_batch()
     }
     writer$Close()
   }
   file$close()
   toc()`
   
   The code with duckdb:
   `
   library(DBI)
   library(arrow)
   library(duckdb)
   library(tictoc)
   con <- dbConnect(duckdb::duckdb())
   
   tic()
   reader <- duckdb_fetch_record_batch(
     dbSendQuery(con," 
       SELECT a.*,b.REGION
       FROM 'V:/PALETTES/parquet/rp68a19.parquet' a
       LEFT JOIN 'V:/PALETTES/SQL/data/dep2014.parquet' b
       ON a.DR=b.DEP
     ", arrow=TRUE))
   
   file <- FileOutputStream$create('V:/PALETTES/tmp/rp68a19d.parquet')
   batch <- reader$read_next_batch()
   if (!is.null(batch)) {
     s <- batch$schema
     writer <- ParquetFileWriter$create(s,file,
            properties = ParquetWriterProperties$create(names(s)))
   
     i <- 0
     while (!is.null(batch)) {
       i <- i+1
       message(sprintf("%d, %d rows",i,nrow(batch)))
       writer$WriteTable(arrow_table(batch),chunk_size=1e6)
       batch <- reader$read_next_batch()
     }
   
     writer$Close()
   }
   file$close()
   toc() `
   
   ### Component(s)
   
   Parquet, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to