Re: [PR] parquet-concat: handle large number of files. [arrow-rs]

via GitHub Sun, 19 Oct 2025 08:59:26 -0700


torgebo commented on PR #8651:
URL: https://github.com/apache/arrow-rs/pull/8651#issuecomment-3419767365


   Hi, good point.
   
   It is true that the proposed change does not affect the ouput row group 
distribution size. So if you pass in a "degenerate" dataset to 
`parquet-concat`, its output file should present those same degeneracies.
   
   It might be that with greater power, comes greater responsibility. I don't 
see that as a strong argument to not make our tools more powerful. Indeed, if 
there is _one true way_ of doing compute, you would likely not need a tool like 
`parquet-concat`.
   
   The suggested change brings the behaviour of `parquet-concat` closer to that 
of the traditional `cat` Unix tool, by handling the files as a "stream". Linux 
`ulimit` is as low as 1024 or even lower on many systems. Many compute 
professionals (e.g. university professionals) are using (time sharing) systems 
where they might not have control over the system settings (or they might need 
to reserve the file descriptors to other use). It seems reasonable to let them 
concatenate their parquet files even so.
   
   Let me know if you have additional concerns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] parquet-concat: handle large number of files. [arrow-rs]

Reply via email to