ryancasburn-KAI commented on issue #45038: URL: https://github.com/apache/arrow/issues/45038#issuecomment-2745266418
As @mapleFU pointed out, it seems like this one parameter (`max_open_files`) is being used for two purposes, keep total file descriptors under a limit and keep memory usage reasonable. Perhaps a second parameter should be added: `max_open_rows`. This would control memory usage while `max_open_files` controls fd use. If `max_open_rows` is reached, the largest file is written (which would drop`max_open_rows` and memory usage the most). If `max_open_files` is reached, the least recently used file is written. This gives people the option to tune the performance as they need. `max_open_rows` could be defaulted to `max_rows_per_file * max_open_files`. With this default, the `max_open_rows` would never be reached. This would help me in work I’m currently doing as I am currently setting a relatively low max_rows_per_file in order to limit memory usage, but this would allow me to set a dataset wide limit allowing individual files to potentially grow larger. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org