[
https://issues.apache.org/jira/browse/ARROW-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kouhei Sutou updated ARROW-14635:
---------------------------------
Fix Version/s: 11.0.0
(was: 10.0.0)
> [C++][Dataset] Devise a mechanism to limit the total "system ram" (process +
> cache) used by dataset writes
> ----------------------------------------------------------------------------------------------------------
>
> Key: ARROW-14635
> URL: https://issues.apache.org/jira/browse/ARROW-14635
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Assignee: Ziheng Wang
> Priority: Major
> Labels: dataset, pull-request-available
> Fix For: 11.0.0
>
> Time Spent: 15h
> Remaining Estimate: 0h
>
> The dataset writer now correctly applies backpressure. However, that
> backpressure is only applied when the write calls slow down. This only
> happens when the OS disk cache fills up.
> However, filling up the OS disk cache is undesirable. It will cause all
> running processes to get swapped (assuming the system has any swap
> configured) and will make the system unusable for anything else.
> This typically has no actual benefit to the dataset write. The marginal
> performance boost provided by the extra RAM is often not worth the cost.
> One way to do this would be to use direct I/O (although that comes with a
> plethora of warnings). Another way might be to flag the output was WONTNEED
> but I don't know for sure if this works (the OS might still cache it so that
> it can satisfy the write call quickly). Another way might be to somehow
> track how much disk cache is being used for writes but that would get
> complex. I'm sure there are other ways I'm just not aware of yet.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)