[
https://issues.apache.org/jira/browse/ARROW-10439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422446#comment-17422446
]
David Li commented on ARROW-10439:
----------------------------------
The approach used for Flight would really only work for IPC, unfortunately. It
optimistically assumes batches are below the limit and hooks into the low-level
IPC writer implementation so that it gets passed the already-serialized batches
- that way, it doesn't waste work computing the actual serialized size (which
is expensive) - and if the batch is over the size limit, rejects it. The caller
is then expected to try again. I suppose you could generalize this to CSV (by
serializing rows to a buffer before writing them out), though that would be an
expensive/invasive refactor (and I have no clue about Parquet).
I'll note even "in-memory size" can be difficult to compute if you have slices.
The GetRecordBatchSize function actually serializes the batch under the hood
and reads the bytes written.
> [C++][Dataset] Add max file size as a dataset writing option
> ------------------------------------------------------------
>
> Key: ARROW-10439
> URL: https://issues.apache.org/jira/browse/ARROW-10439
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 2.0.0
> Reporter: Ben Kietzman
> Assignee: Weston Pace
> Priority: Minor
> Labels: beginner, dataset, query-engine
> Fix For: 6.0.0
>
>
> This should be specified as a row limit.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)