[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

David Li (Jira) Wed, 29 Sep 2021 16:45:10 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422446#comment-17422446
 ]


David Li commented on ARROW-10439:
----------------------------------

The approach used for Flight would really only work for IPC, unfortunately. It 
optimistically assumes batches are below the limit and hooks into the low-level 
IPC writer implementation so that it gets passed the already-serialized batches 
- that way, it doesn't waste work computing the actual serialized size (which 
is expensive) - and if the batch is over the size limit, rejects it. The caller 
is then expected to try again. I suppose you could generalize this to CSV (by 
serializing rows to a buffer before writing them out), though that would be an 
expensive/invasive refactor (and I have no clue about Parquet).

I'll note even "in-memory size" can be difficult to compute if you have slices. 
The GetRecordBatchSize function actually serializes the batch under the hood 
and reads the bytes written.

> [C++][Dataset] Add max file size as a dataset writing option
> ------------------------------------------------------------
>
>                 Key: ARROW-10439
>                 URL: https://issues.apache.org/jira/browse/ARROW-10439
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 2.0.0
>            Reporter: Ben Kietzman
>            Assignee: Weston Pace
>            Priority: Minor
>              Labels: beginner, dataset, query-engine
>             Fix For: 6.0.0
>
>
> This should be specified as a row limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10439) [C++][Dataset] Add max file size as a dataset writing option

Reply via email to