[
https://issues.apache.org/jira/browse/ARROW-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400304#comment-17400304
]
Weston Pace commented on ARROW-13644:
-------------------------------------
> You would acquire the semaphore when opening a file and release it when
> closing the file
When writing a dataset we only close the files at the very end when all data
has been processed. There is no way to know ahead of time when we are finished
working with a file. The very last batch of a 50GB dataset might write to the
same file that the first batch did.
The current implementation handles this by leaving all files open for the
entire scan. That leads to running out of file handles. So as a compromise we
can set a limit to how many files we have open. When we reach the limit we
have to close one of the current open files. This might mean we create more
files than strictly necessary. LRU is one way to handle this.
This compromise won't always work well. Maybe it will help to consider a case
where LRU performs poorly. If a dataset looks something like...
{code:csv}
part_col, val
1,0
2,1
3,2
4,3
5,4
1,5
2,6
3,7
4,8
5,9
1,10
2,11
3,12
4,13
5,14
{code}
If we partition on `part_col` with a batch size of 5 we will get three batches
and then if we don't do any limit on how many files we can have open then we
end up with 5 files:
{code}
part_col=1/part-0.arrow
part_col=2/part-1.arrow
part_col=3/part-2.arrow
part_col=4/part-3.arrow
part_col=5/part-4.arrow
{code}
We will open 5 files on the first batch and keep them open the entire read. If
we need to limit how many files we have open (let's say 3) then we need to
figure something out. With LRU we'd end up with 15 files...
{code}
part_col=1/part-0.arrow
part_col=1/part-5.arrow
part_col=1/part-10.arrow
part_col=2/part-1.arrow
part_col=2/part-6.arrow
...
{code}
Another way to handle it would be to sort the complete data by the partition
column(s) but that would introduce a pipeline breaker. Or we could close the
file that has the most rows in it but that would require a priority queue which
isn't really simpler.
> [C++] Create LruCache that works with futures
> ---------------------------------------------
>
> Key: ARROW-13644
> URL: https://issues.apache.org/jira/browse/ARROW-13644
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: C++
> Reporter: Weston Pace
> Assignee: Weston Pace
> Priority: Major
>
> The dataset writer needs an LRU cache to keep track of open files so that it
> can respect a "max open files" property (see ARROW-12321). A synchronous
> LruCache implementation already exists but on eviction from the cache we need
> to wait until all pending writes have completed before we evict the item and
> open a new file. This ticket is to create an AsyncLruCache which will allow
> the creation of items and the eviction of items to be asynchronous.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)