rmnskb commented on PR #48101:
URL: https://github.com/apache/arrow/pull/48101#issuecomment-3519405213

   As was mentioned above, PyArrow doesn't store any metadata about the parquet 
pages. To ensure that the argument works, I've create this sample parquet file:
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   table = pa.table({"x": [1,2,3,4,5,6,7]})
   pq.write_table(table, "/tmp/max_rows_per_page.parquet", max_rows_per_page=1)
   ```
   Then, I've used `parquet-cli` from [Parquet 
Java](https://github.com/apache/parquet-java/tree/master/parquet-cli) to see 
the pages info. Here's the output:
   ```bash
   > parquet-cli pages /tmp/max_rows_per_page.parquet
   
   Column: x
   
--------------------------------------------------------------------------------
     page   type  enc  count   avg size   size       rows     nulls   min / max
     0-D    dict  S _  7       8.00 B     56 B
     0-1    data  S R  1       9.00 B     9 B                 0       "1" / "1"
     0-2    data  S R  1       9.00 B     9 B                 0       "2" / "2"
     0-3    data  S R  1       9.00 B     9 B                 0       "3" / "3"
     0-4    data  S R  1       9.00 B     9 B                 0       "4" / "4"
     0-5    data  S R  1       9.00 B     9 B                 0       "5" / "5"
     0-6    data  S R  1       9.00 B     9 B                 0       "6" / "6"
     0-7    data  S R  1       9.00 B     9 B                 0       "7" / "7"
   ```
   So the Python binding does indeed produce one page per row in the case above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to