rmnskb commented on PR #48101:
URL: https://github.com/apache/arrow/pull/48101#issuecomment-3519405213
As was mentioned above, PyArrow doesn't store any metadata about the parquet
pages. To ensure that the argument works, I've create this sample parquet file:
```python
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({"x": [1,2,3,4,5,6,7]})
pq.write_table(table, "/tmp/max_rows_per_page.parquet", max_rows_per_page=1)
```
Then, I've used `parquet-cli` from [Parquet
Java](https://github.com/apache/parquet-java/tree/master/parquet-cli) to see
the pages info. Here's the output:
```bash
> parquet-cli pages /tmp/max_rows_per_page.parquet
Column: x
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict S _ 7 8.00 B 56 B
0-1 data S R 1 9.00 B 9 B 0 "1" / "1"
0-2 data S R 1 9.00 B 9 B 0 "2" / "2"
0-3 data S R 1 9.00 B 9 B 0 "3" / "3"
0-4 data S R 1 9.00 B 9 B 0 "4" / "4"
0-5 data S R 1 9.00 B 9 B 0 "5" / "5"
0-6 data S R 1 9.00 B 9 B 0 "6" / "6"
0-7 data S R 1 9.00 B 9 B 0 "7" / "7"
```
So the Python binding does indeed produce one page per row in the case above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]