Re: [I] [C++] write_dataset: max_open_files does not close least recently used file [arrow]

via GitHub Sat, 22 Mar 2025 06:34:14 -0700


ryancasburn-KAI commented on issue #45038:
URL: https://github.com/apache/arrow/issues/45038#issuecomment-2745266418


   As @mapleFU pointed out, it seems like this one parameter (`max_open_files`) 
is being used for two purposes, keep total file descriptors under a limit and 
keep memory usage reasonable. 
   
   Perhaps a second parameter should be added: `max_open_rows`. This would 
control memory usage while `max_open_files` controls fd use.
   
   If `max_open_rows` is reached, the largest file is written (which would 
drop`max_open_rows` and memory usage the most). If `max_open_files` is reached, 
the least recently used file is written. This gives people the option to tune 
the performance as they need. 
   
   `max_open_rows` could be defaulted to `max_rows_per_file * max_open_files`. 
With this default, the `max_open_rows` would never be reached.
   
   This would help me in work I’m currently doing as I am currently setting a 
relatively low max_rows_per_file in order to limit memory usage, but this would 
allow me to set a dataset wide limit allowing individual files to potentially 
grow larger.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [C++] write_dataset: max_open_files does not close least recently used file [arrow]

Reply via email to