torgebo commented on PR #8651: URL: https://github.com/apache/arrow-rs/pull/8651#issuecomment-3419767365
Hi, good point. It is true that the proposed change does not affect the ouput row group distribution size. So if you pass in a "degenerate" dataset to `parquet-concat`, its output file should present those same degeneracies. It might be that with greater power, comes greater responsibility. I don't see that as a strong argument to not make our tools more powerful. Indeed, if there is _one true way_ of doing compute, you would likely not need a tool like `parquet-concat`. The suggested change brings the behaviour of `parquet-concat` closer to that of the traditional `cat` Unix tool, by handling the files as a "stream". Linux `ulimit` is as low as 1024 or even lower on many systems. Many compute professionals (e.g. university professionals) are using (time sharing) systems where they might not have control over the system settings (or they might need to reserve the file descriptors to other use). It seems reasonable to let them concatenate their parquet files even so. Let me know if you have additional concerns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
