[jira] [Commented] (ARROW-16015) [python] Scanning batch size is limited to 65536 (2**16).

Weston Pace (Jira) Wed, 23 Mar 2022 21:38:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511588#comment-17511588
 ]


Weston Pace commented on ARROW-16015:
-------------------------------------

The batch size can split up larger row groups to fit small batch sizes but it 
won't, at the moment, merge together small row groups to fit large batch sizes. 
 Performance-wise this tends to be expensive (you'd need to allocate space big 
enough for both and then copy the data) but I can see how it might be useful is 
some scenarios.

> [python] Scanning batch size is limited to 65536 (2**16).
> ---------------------------------------------------------
>
>                 Key: ARROW-16015
>                 URL: https://issues.apache.org/jira/browse/ARROW-16015
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0, 8.0.0
>         Environment: macOS
>            Reporter: A. Coady
>            Priority: Major
>
> [Scanning 
> batches|https://arrow.apache.org/docs/python/dataset.html#iterative-out-of-core-or-streaming-reads]
>  is documented to default to a batch size of 1,000,000. But the behavior is 
> that batch size defaults to - and is limited to - 65536.
> {code:python}
> In []: dataset.count_rows()
> Out[]: 538038292
> In []: next(dataset.to_batches()).num_rows
> Out[]: 65536
> In []: next(dataset.to_batches(batch_size=10**6)).num_rows
> Out[]: 65536
> In []: next(dataset.to_batches(batch_size=10**4)).num_rows
> Out[]: 10000
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-16015) [python] Scanning batch size is limited to 65536 (2**16).

Reply via email to