pitrou commented on code in PR #47758:
URL: https://github.com/apache/arrow/pull/47758#discussion_r2416690462


##########
python/pyarrow/parquet/core.py:
##########
@@ -265,6 +265,9 @@ class ParquetFile:
         If True, read Parquet logical types as Arrow extension types where 
possible,
         (e.g., read JSON as the canonical `arrow.json` extension type or UUID 
as
         the canonical `arrow.uuid` extension type).
+    max_page_header_size : int, default None
+        If not None, override the maximum size of a page header.
+        Deafults to 16MB, which should be sufficient for most Parquet files.

Review Comment:
   ```suggestion
           Defaults to 16MB, which should be sufficient for most Parquet files.
   ```



##########
cpp/src/parquet/properties.h:
##########
@@ -101,6 +104,11 @@ class PARQUET_EXPORT ReaderProperties {
   /// Set the size of the buffered stream buffer in bytes.
   void set_buffer_size(int64_t size) { buffer_size_ = size; }
 
+  /// Return the size of the buffered stream buffer. 0 means default

Review Comment:
   Please, can you make sure the docstrings actually describe the methods 
inside of blindly copy-pasting them?



##########
python/pyarrow/parquet/core.py:
##########
@@ -314,7 +317,8 @@ def __init__(self, source, *, metadata=None, 
common_metadata=None,
                  coerce_int96_timestamp_unit=None,
                  decryption_properties=None, thrift_string_size_limit=None,
                  thrift_container_size_limit=None, filesystem=None,
-                 page_checksum_verification=False, 
arrow_extensions_enabled=True):
+                 page_checksum_verification=False, 
arrow_extensions_enabled=True,
+                 max_page_header_size=None):

Review Comment:
   It's ok, but is the new argument also available on `pq.read_table` and 
friends?



##########
cpp/src/parquet/properties.h:
##########
@@ -101,6 +104,11 @@ class PARQUET_EXPORT ReaderProperties {
   /// Set the size of the buffered stream buffer in bytes.
   void set_buffer_size(int64_t size) { buffer_size_ = size; }
 
+  /// Return the size of the buffered stream buffer. 0 means default
+  uint32_t max_page_header_size() const { return max_page_header_size_; }
+  /// Set the size of the buffered stream buffer in bytes. 0 means default
+  void set_max_page_header_size(uint32_t size) { max_page_header_size_ = size; 
}

Review Comment:
   Can we change these to take `int32_t`? We try to avoid unsigned integers in 
public API.



##########
cpp/src/parquet/properties.h:
##########
@@ -57,6 +57,9 @@ enum class SizeStatisticsLevel : uint8_t {
   PageAndColumnChunk
 };
 
+// 16 MB is the default maximum page header size
+static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024;

Review Comment:
   Also let's make this `int32_t` not `uint32_t`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to