wjones127 commented on issue #34758:
URL: https://github.com/apache/arrow/issues/34758#issuecomment-1489245292

   I think the issue you are seeing is that `Dataset.to_batches()` doesn't 
combine row groups, so if your row groups are smaller than a batch size you 
will get smaller batches. You should think of `batch_size` here as an upper 
bound.
   
   Write a Parquet file with row group size of 100:
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.compute as pc
   import pyarrow.dataset as ds
   
   tab = pa.table({
       'x': pc.random(1_000)
   })
   path = "test_stream.parquet"
   pq.write_table(tab, path,  row_group_size=100)
   ```
   
   Trying to get batches of size 400 only gives size 100, the size of the row 
groups:
   ```python
   dataset = ds.dataset(path)
   it = dataset.to_batches(batch_size=400)
   lst = [x.num_rows for x in it]
   lst
   ```
   
   ```
   [100, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   ```
   
   You should be able to inspect your file's metadata like this:
   
   ```python
   metadata = pq.read_metadata(path)
   [metadata.row_group(i) for i in range(metadata.num_row_groups)]
   ```
   
   ```
   [<pyarrow._parquet.RowGroupMetaData object at 0x127f8a8e0>
      num_columns: 1
      num_rows: 100
      total_byte_size: 981,
    <pyarrow._parquet.RowGroupMetaData object at 0x136e72890>
      num_columns: 1
      num_rows: 100
      total_byte_size: 981,
   ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to