[
https://issues.apache.org/jira/browse/ARROW-14426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432764#comment-17432764
]
Weston Pace commented on ARROW-14426:
-------------------------------------
So things can get a little tricky in certain situations. For example, if
min_row_groups_size is 1M and the max_rows_queued is 64M and you just so happen
to have 900k rows per file and are creating 100 files then you would end up in
deadlock because it wouldn't write anything and it would hit the
max_rows_queued limit.
Even if you were only creating 50 files it would still be non-ideal because
none of the writes would happen until the entire dataset had accumulated in
memory.
To work around this I think I will create a soft limit (defaulting to 8M rows
because I like nice round powers of two) of batchable rows. Once there are
more than 8M batchable rows I will start evicting batches, even though they are
smaller than min_row_group_size.
I'm fairly certain this will go unnoticed in 99% of scenarios until some point
in the future when I've forgotten all of this and I'm debugging why a small
batch got created.
> [C++] Add a minimum_row_group_size to dataset writing
> -----------------------------------------------------
>
> Key: ARROW-14426
> URL: https://issues.apache.org/jira/browse/ARROW-14426
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Jonathan Keane
> Assignee: Weston Pace
> Priority: Major
>
> Right now we right whatever chunks we get, but if those chunks are
> exceptionally small, we should bundle them up and write out a configurable
> minimum row group size
--
This message was sent by Atlassian Jira
(v8.3.4#803005)