A. Coady created ARROW-16015:
--------------------------------

             Summary: Scanning batch size is limited to 65536 (2**16).
                 Key: ARROW-16015
                 URL: https://issues.apache.org/jira/browse/ARROW-16015
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 7.0.0, 8.0.0
         Environment: macOS
            Reporter: A. Coady


[Scanning 
batches|https://arrow.apache.org/docs/python/dataset.html#iterative-out-of-core-or-streaming-reads]
 is documented to default to a batch size of 1,000,000. But the behavior is 
that batch size defaults to - and is limited to - 65536.
{code:python}
In []: dataset.count_rows()
Out[]: 538038292

In []: next(dataset.to_batches()).num_rows
Out[]: 65536

In []: next(dataset.to_batches(batch_size=10**6)).num_rows
Out[]: 65536

In []: next(dataset.to_batches(batch_size=10**4)).num_rows
Out[]: 10000
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to