jorisvandenbossche commented on code in PR #38360:
URL: https://github.com/apache/arrow/pull/38360#discussion_r1392194526
##########
python/pyarrow/_parquet.pyx:
##########
@@ -1703,6 +1708,13 @@ cdef shared_ptr[WriterProperties]
_create_writer_properties(
# a size larger than this then it will be latched to this value.
props.max_row_group_length(_MAX_ROW_GROUP_SIZE)
+ # checksum
+
+ if page_checksum_enabled:
+ props.enable_page_checksum()
+ else:
+ props.disable_page_checksum()
Review Comment:
Small naming suggestion: in the Python API, for several other keywords that
use a "enable" terminology on the C++ side, we use "write_" or "use_" on the
Python side. For example "enable_statistics" on the C++ side is
"write_statistics" here.
So maybe we could also use `write_page_checksum` for the Python user facing
keyword.
##########
python/pyarrow/parquet/core.py:
##########
@@ -887,6 +891,10 @@ def _sanitize_table(table, new_schema, flavor):
filtering more efficient than the page header, as it gathers all the
statistics for a Parquet file in a single place, avoiding scattered I/O.
Note that the page index is not yet used on the read size by PyArrow.
+page_checksum_enabled : bool, default False
+ Whether to write page checksums in general for all columns.
+ Page checksums enable detection of corruption, which might occur during
Review Comment:
```suggestion
Page checksums enable detection of data corruption, which might occur
during
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]