[jira] [Updated] (ARROW-14635) [C++][Dataset] Devise a mechanism to limit the total "system ram" (process + cache) used by dataset writes

Kouhei Sutou (Jira) Wed, 19 Oct 2022 21:55:09 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kouhei Sutou updated ARROW-14635:
---------------------------------
    Fix Version/s: 11.0.0
                       (was: 10.0.0)

> [C++][Dataset] Devise a mechanism to limit the total "system ram" (process + 
> cache) used by dataset writes
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14635
>                 URL: https://issues.apache.org/jira/browse/ARROW-14635
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Ziheng Wang
>            Priority: Major
>              Labels: dataset, pull-request-available
>             Fix For: 11.0.0
>
>          Time Spent: 15h
>  Remaining Estimate: 0h
>
> The dataset writer now correctly applies backpressure.  However, that 
> backpressure is only applied when the write calls slow down.  This only 
> happens when the OS disk cache fills up.
> However, filling up the OS disk cache is undesirable.  It will cause all 
> running processes to get swapped (assuming the system has any swap 
> configured) and will make the system unusable for anything else.
> This typically has no actual benefit to the dataset write.  The marginal 
> performance boost provided by the extra RAM is often not worth the cost.
> One way to do this would be to use direct I/O (although that comes with a 
> plethora of warnings).  Another way might be to flag the output was WONTNEED 
> but I don't know for sure if this works (the OS might still cache it so that 
> it can satisfy the write call quickly).  Another way might be to somehow 
> track how much disk cache is being used for writes but that would get 
> complex.  I'm sure there are other ways I'm just not aware of yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-14635) [C++][Dataset] Devise a mechanism to limit the total "system ram" (process + cache) used by dataset writes

Reply via email to