[ 
https://issues.apache.org/jira/browse/ARROW-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271224#comment-17271224
 ] 

Joris Van den Bossche commented on ARROW-11120:
-----------------------------------------------

A few notes to clarify why you got a table with such many small record batches:

- By default, the dataset scanner reads each parquet RowGroup into one 
RecordBatch
- The {{batch_size}} keyword is a _maximum_ number of rows, so it will split 
row groups into multiple record batches when above that max, but will (AFAIK) 
never combine different smaller row groups to obtain record batches closer to 
the {{batch_size}}
- Because you are applying a filter, which is done per row group/record batch, 
you end up with small row groups and thus many small record batches (which are 
currently not rechunked, and varying {{batch_size}} has no effect in this case)

Additionally, I think that our files in the ursa S3 bucket have been written 
with row groups of 65k rows, which is already relatively small, certainly when 
heavily filtering them.

Now, the resulting table with chunks of on average 120 rows is certainly not 
ideal (even if the conversion to R performance issue is fixed). A method to 
"re-chunk" a Table or ChunkedArray seems generally useful indeed, so I opened 
ARROW-11370 for this. We might also want to think if we can integrate something 
directly into the dataset API as well.






> [Python][R] Prove out plumbing to pass data between Python and R using rpy2
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-11120
>                 URL: https://issues.apache.org/jira/browse/ARROW-11120
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python, R
>            Reporter: Wes McKinney
>            Priority: Major
>
> Per discussion on the mailing list, we should see what is required (if 
> anything) to be able to pass data structures using the C interface between 
> Python and R from the perspective of the Python user using rpy2. rpy2 is sort 
> of the Python version of reticulate. Unit tests will then validate that it's 
> working



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to