[jira] [Commented] (ARROW-16015) [python] Scanning batch size is limited to 65536 (2**16).

Joris Van den Bossche (Jira) Thu, 24 Mar 2022 04:07:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511785#comment-17511785
 ]


Joris Van den Bossche commented on ARROW-16015:
-----------------------------------------------

We can maybe clarify in the documentation of {{batch_size}} that the actual 
upper limit might depend on your files (such as row group size in Parquet). 

I suppose this is also the same for the record batch size in IPC / Feather 
files? (or stripe size in ORC files)

> [python] Scanning batch size is limited to 65536 (2**16).
> ---------------------------------------------------------
>
>                 Key: ARROW-16015
>                 URL: https://issues.apache.org/jira/browse/ARROW-16015
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0, 8.0.0
>         Environment: macOS
>            Reporter: A. Coady
>            Priority: Major
>
> [Scanning 
> batches|https://arrow.apache.org/docs/python/dataset.html#iterative-out-of-core-or-streaming-reads]
>  is documented to default to a batch size of 1,000,000. But the behavior is 
> that batch size defaults to - and is limited to - 65536.
> {code:python}
> In []: dataset.count_rows()
> Out[]: 538038292
> In []: next(dataset.to_batches()).num_rows
> Out[]: 65536
> In []: next(dataset.to_batches(batch_size=10**6)).num_rows
> Out[]: 65536
> In []: next(dataset.to_batches(batch_size=10**4)).num_rows
> Out[]: 10000
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-16015) [python] Scanning batch size is limited to 65536 (2**16).

Reply via email to