Re: [PR] parquet-concat: handle large number of files. [arrow-rs]

via GitHub Thu, 23 Oct 2025 14:42:19 -0700


torgebo commented on PR #8651:
URL: https://github.com/apache/arrow-rs/pull/8651#issuecomment-3439337591


   Why should the user not be allowed to concatenate files into a larger file?
   
   First, we should assume the user _needs their data input into a single 
file_, perhaps as part of integrating with an external system. They likely 
already have their data on disk. They might choose to use `parquet-concat` 
because it (a) copies the data points correctly, and (b) preserves the schema 
from the original dataset, and (c) it's reasonably performant*.
   
   For the napkin calculations on row group contents:
   Take the valid boundary case of a 1-column dataset. Assume each file is 
128Mb, and that there's 1000 files. Then the output would be 128Gb, which is 
well within what we can generate with the new version of the tool on a single 
laptop. The row group size can be up towards 128Mb, which should not be too bad 
(should optimally be larger, not smaller?). I would not call it "degenerate".
   
   * Except for the limit on # of open files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] parquet-concat: handle large number of files. [arrow-rs]

Reply via email to