[I] efficiently combine parquet files [arrow]

via GitHub Wed, 17 Jan 2024 12:56:29 -0800


r2evans opened a new issue, #39671:
URL: https://github.com/apache/arrow/issues/39671


   ### Describe the enhancement requested
   
   I recognize that appending to parquet files is not on the roadmap. Is it 
possible to do an efficient concatenation of two parquets with the output to a 
parquet file? While brute-force methods exist (read all of "A", read all of 
"B", and row-concatenate them however the language allows), it requires loading 
all data into memory. (I'm specifically targeting R, where it's perhaps more 
difficult to use the lower-level API.)
   
   Part of the alternative to the "append" request (such as 
https://github.com/apache/arrow/issues/32708) is 
https://github.com/apache/arrow/issues/32708#issuecomment-1378120110:
   
   > the pattern that Arrow enables is writing multiple files and then using 
open_dataset() to query them lazily
   
   This works fine in concept, though as the count of files increases, 
eventually there is a tradeoff with performance. This penalty can be mitigated 
(e.g., `unify_schemas=FALSE`), but eventually there may be a time when there is 
the desire to reduce the number of files by combining them. The brute-force 
read of both _works_, but it would be very nice to have a simple function that 
takes 1+ input filenames and 1 output filename (previously non-existent) and as 
efficiently as possible concatenates the data (handling meta, of course). I'm 
guessing there would need to be assumptions/requirements with regards to the 
schemas between the files, perhaps a first guess would require "effectively 
identical" (where "effectively" might allow differences such as 
`numeric`/`integer` or similar), but I'd still be very happy with "perfectly 
identical".
   
   I'm specifically targeting R in my usage, though I guess that other 
languages might also take advantage of this.
   
   Thanks!
   
   ### Component(s)
   
   Parquet, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] efficiently combine parquet files [arrow]

Reply via email to