[ https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400304#comment-17400304 ]
Weston Pace edited comment on ARROW-13644 at 8/17/21, 10:11 AM: ---------------------------------------------------------------- > You would acquire the semaphore when opening a file and release it when > closing the file When writing a dataset we only close the files at the very end when all data has been processed. There is no way to know ahead of time when we are finished working with a file. The very last batch of a 50GB dataset might write to the same file that the first batch did. The current implementation handles this by leaving all files open for the entire scan. That leads to running out of file handles. So as a compromise we can set a limit to how many files we have open. When we reach the limit we have to close one of the current open files. This might mean we create more files than strictly necessary. LRU is one way to handle this. This compromise won't always work well. Maybe it will help to consider a case where LRU performs poorly. If a dataset looks something like... {code} part_col, val 1,0 2,1 3,2 4,3 5,4 1,5 2,6 3,7 4,8 5,9 1,10 2,11 3,12 4,13 5,14 {code} If we partition on `part_col` with a batch size of 5 we will get three batches and then if we don't do any limit on how many files we can have open then we end up with 5 files: {code} part_col=1/part-0.arrow part_col=2/part-1.arrow part_col=3/part-2.arrow part_col=4/part-3.arrow part_col=5/part-4.arrow {code} We will open 5 files on the first batch and keep them open the entire read. If we need to limit how many files we have open (let's say 3) then we need to figure something out. With LRU we'd end up with 15 files... {code} part_col=1/part-0.arrow part_col=1/part-5.arrow part_col=1/part-10.arrow part_col=2/part-1.arrow part_col=2/part-6.arrow ... {code} Another way to handle it would be to sort the complete data by the partition column(s) but that would introduce a pipeline breaker. Or we could close the file that has the most rows in it but that would require a priority queue which isn't really simpler. was (Author: westonpace): > You would acquire the semaphore when opening a file and release it when > closing the file When writing a dataset we only close the files at the very end when all data has been processed. There is no way to know ahead of time when we are finished working with a file. The very last batch of a 50GB dataset might write to the same file that the first batch did. The current implementation handles this by leaving all files open for the entire scan. That leads to running out of file handles. So as a compromise we can set a limit to how many files we have open. When we reach the limit we have to close one of the current open files. This might mean we create more files than strictly necessary. LRU is one way to handle this. This compromise won't always work well. Maybe it will help to consider a case where LRU performs poorly. If a dataset looks something like... {code:csv} part_col, val 1,0 2,1 3,2 4,3 5,4 1,5 2,6 3,7 4,8 5,9 1,10 2,11 3,12 4,13 5,14 {code} If we partition on `part_col` with a batch size of 5 we will get three batches and then if we don't do any limit on how many files we can have open then we end up with 5 files: {code} part_col=1/part-0.arrow part_col=2/part-1.arrow part_col=3/part-2.arrow part_col=4/part-3.arrow part_col=5/part-4.arrow {code} We will open 5 files on the first batch and keep them open the entire read. If we need to limit how many files we have open (let's say 3) then we need to figure something out. With LRU we'd end up with 15 files... {code} part_col=1/part-0.arrow part_col=1/part-5.arrow part_col=1/part-10.arrow part_col=2/part-1.arrow part_col=2/part-6.arrow ... {code} Another way to handle it would be to sort the complete data by the partition column(s) but that would introduce a pipeline breaker. Or we could close the file that has the most rows in it but that would require a priority queue which isn't really simpler. > [C++] Create LruCache that works with futures > --------------------------------------------- > > Key: ARROW-13644 > URL: https://issues.apache.org/jira/browse/ARROW-13644 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ > Reporter: Weston Pace > Assignee: Weston Pace > Priority: Major > > The dataset writer needs an LRU cache to keep track of open files so that it > can respect a "max open files" property (see ARROW-12321). A synchronous > LruCache implementation already exists but on eviction from the cache we need > to wait until all pending writes have completed before we evict the item and > open a new file. This ticket is to create an AsyncLruCache which will allow > the creation of items and the eviction of items to be asynchronous. -- This message was sent by Atlassian Jira (v8.3.4#803005)