[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

Weston Pace (Jira) Wed, 29 Sep 2021 16:12:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422441#comment-17422441
 ]


Weston Pace commented on ARROW-10439:
-------------------------------------

So the challenge with a bytes limit is that we need to know how many bytes are 
going to be written before the potentially blocking write call.  The way the 
file writers are currently structured that is not easy.  Options available:

 * Modify the file writers to be truly asynchronous and return "{ bytes_queued: 
int64_t, Future<>: write_future }" (or they could return a Future<> and have a 
method to query how many total bytes have been queued to be written to the 
file).
 * Use the in-memory size of the data (the downside is that this can be quite 
different from the written size if compression is used which is often the case).
 * Enforce a best-effort limit which checks the current file size when 
determining if a new file should be opened.  The problem in this case is we 
will queue some number of batches more than we should so the limit will be a 
soft limit that we will typically shoot past by some amount.

Does anyone have any other ideas or suggestions or have a preference amongst 
the available options?  [~lidavidm] what was the approach used for flight?

> [C++][Dataset] Add max file size as a dataset writing option
> ------------------------------------------------------------
>
>                 Key: ARROW-10439
>                 URL: https://issues.apache.org/jira/browse/ARROW-10439
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 2.0.0
>            Reporter: Ben Kietzman
>            Assignee: Weston Pace
>            Priority: Minor
>              Labels: beginner, dataset, query-engine
>             Fix For: 6.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

Reply via email to