[
https://issues.apache.org/jira/browse/ARROW-16015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512021#comment-17512021
]
Weston Pace commented on ARROW-16015:
-------------------------------------
Yes. Clarification would be good. We could even change the parameter name to
{{max_batch_size}}. That is what we did in the table source node. Also, a
{{min_batch_size}} is still useful I think, though we would just want to
document that it has performance implications.
You are correct that this would apply to IPC & ORC. This even applies to CSV
because the batch size is independent of the block size (which is specified in
the CSV read options).
> [Python] Scanning batch size is limited to 65536 (2**16).
> ---------------------------------------------------------
>
> Key: ARROW-16015
> URL: https://issues.apache.org/jira/browse/ARROW-16015
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 7.0.0, 8.0.0
> Environment: macOS
> Reporter: A. Coady
> Priority: Major
>
> [Scanning
> batches|https://arrow.apache.org/docs/python/dataset.html#iterative-out-of-core-or-streaming-reads]
> is documented to default to a batch size of 1,000,000. But the behavior is
> that batch size defaults to - and is limited to - 65536.
> {code:python}
> In []: dataset.count_rows()
> Out[]: 538038292
> In []: next(dataset.to_batches()).num_rows
> Out[]: 65536
> In []: next(dataset.to_batches(batch_size=10**6)).num_rows
> Out[]: 65536
> In []: next(dataset.to_batches(batch_size=10**4)).num_rows
> Out[]: 10000
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)