mapleFU commented on code in PR #36290:
URL: https://github.com/apache/arrow/pull/36290#discussion_r1254499184


##########
python/pyarrow/parquet/core.py:
##########
@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
     it will restore the timezone (Parquet only stores the UTC values without
     timezone), or columns with duration type will be restored from the int64
     Parquet column.
+write_page_index : bool, default False

Review Comment:
   @jorisvandenbossche I've done that. By the way, we cannot filter using 
pyarrow, but `parquet-rs` and `parquet-mr` can optimize by it.



##########
python/pyarrow/tests/parquet/test_metadata.py:
##########
@@ -357,6 +357,20 @@ def test_field_id_metadata():
     assert schema[5].metadata[field_id] == b'-1000'
 
 
+def test_parquet_file_page_index():
+    table = pa.table({'a': [1, 2, 3]})
+
+    writer = pa.BufferOutputStream()
+    _write_table(table, writer, write_page_index=True)
+    reader = pa.BufferReader(writer.getvalue())
+
+    # Can retrieve sorting columns from metadata
+    metadata = pq.read_metadata(reader)
+    cc = metadata.row_group(0).column(0)
+    assert cc.has_offset_index is True
+    assert cc.has_column_index is True

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to