wjones127 commented on issue #34758:
URL: https://github.com/apache/arrow/issues/34758#issuecomment-1489245292
I think the issue you are seeing is that `Dataset.to_batches()` doesn't
combine row groups, so if your row groups are smaller than a batch size you
will get smaller batches. You should think of `batch_size` here as an upper
bound.
Write a Parquet file with row group size of 100:
```python
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pyarrow.dataset as ds
tab = pa.table({
'x': pc.random(1_000)
})
path = "test_stream.parquet"
pq.write_table(tab, path, row_group_size=100)
```
Trying to get batches of size 400 only gives size 100, the size of the row
groups:
```python
dataset = ds.dataset(path)
it = dataset.to_batches(batch_size=400)
lst = [x.num_rows for x in it]
lst
```
```
[100, 100, 100, 100, 100, 100, 100, 100, 100, 100]
```
You should be able to inspect your file's metadata like this:
```python
metadata = pq.read_metadata(path)
[metadata.row_group(i) for i in range(metadata.num_row_groups)]
```
```
[<pyarrow._parquet.RowGroupMetaData object at 0x127f8a8e0>
num_columns: 1
num_rows: 100
total_byte_size: 981,
<pyarrow._parquet.RowGroupMetaData object at 0x136e72890>
num_columns: 1
num_rows: 100
total_byte_size: 981,
...
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]