[ 
https://issues.apache.org/jira/browse/ARROW-14426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432764#comment-17432764
 ] 

Weston Pace commented on ARROW-14426:
-------------------------------------

So things can get a little tricky in certain situations.  For example, if 
min_row_groups_size is 1M and the max_rows_queued is 64M and you just so happen 
to have 900k rows per file and are creating 100 files then you would end up in 
deadlock because it wouldn't write anything and it would hit the 
max_rows_queued limit.

Even if you were only creating 50 files it would still be non-ideal because 
none of the writes would happen until the entire dataset had accumulated in 
memory.

To work around this I think I will create a soft limit (defaulting to 8M rows 
because I like nice round powers of two) of batchable rows.  Once there are 
more than 8M batchable rows I will start evicting batches, even though they are 
smaller than min_row_group_size.

I'm fairly certain this will go unnoticed in 99% of scenarios until some point 
in the future when I've forgotten all of this and I'm debugging why a small 
batch got created.

> [C++] Add a minimum_row_group_size to dataset writing
> -----------------------------------------------------
>
>                 Key: ARROW-14426
>                 URL: https://issues.apache.org/jira/browse/ARROW-14426
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Jonathan Keane
>            Assignee: Weston Pace
>            Priority: Major
>
> Right now we right whatever chunks we get, but if those chunks are 
> exceptionally small, we should bundle them up and write out a configurable 
> minimum row group size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to