Re: [PR] GH-48334: Support reading encrypted bloom filters [arrow]

via GitHub Fri, 13 Mar 2026 10:57:58 -0700


fenfeng9 commented on code in PR #49334:
URL: https://github.com/apache/arrow/pull/49334#discussion_r2932896741



##########
cpp/src/parquet/bloom_filter.cc:
##########
@@ -106,7 +126,100 @@ static ::arrow::Status ValidateBloomFilterHeader(
 
 BlockSplitBloomFilter BlockSplitBloomFilter::Deserialize(
     const ReaderProperties& properties, ArrowInputStream* input,
-    std::optional<int64_t> bloom_filter_length) {
+    std::optional<int64_t> bloom_filter_length, Decryptor* header_decryptor,
+    Decryptor* bitset_decryptor) {
+  if (header_decryptor != nullptr || bitset_decryptor != nullptr) {
+    if (header_decryptor == nullptr || bitset_decryptor == nullptr) {
+      throw ParquetException(
+          "Bloom filter decryptors must be both provided or both null");
+    }
+
+    // Encrypted path: header and bitset are encrypted separately.
+    ThriftDeserializer deserializer(properties);
+    format::BloomFilterHeader header;
+
+    // Read the length-prefixed ciphertext for the header.
+    PARQUET_ASSIGN_OR_THROW(auto length_buf, 
input->Read(kCiphertextLengthSize));
+    if (ARROW_PREDICT_FALSE(length_buf->size() < kCiphertextLengthSize)) {
+      throw ParquetException("Bloom filter header read failed: not enough 
data");
+    }
+
+    const int64_t header_cipher_total_len =
+        ParseCiphertextTotalLength(length_buf->data(), length_buf->size());
+    if (ARROW_PREDICT_FALSE(header_cipher_total_len >
+                            std::numeric_limits<int32_t>::max())) {
+      throw ParquetException("Bloom filter header ciphertext length overflows 
int32");
+    }
+    if (bloom_filter_length && header_cipher_total_len > *bloom_filter_length) 
{
+      throw ParquetException(
+          "Bloom filter length less than encrypted bloom filter header 
length");
+    }
+    // Read the full header ciphertext and decrypt the Thrift header.
+    auto header_cipher_buf =
+        AllocateBuffer(properties.memory_pool(), header_cipher_total_len);
+    std::memcpy(header_cipher_buf->mutable_data(), length_buf->data(),
+                kCiphertextLengthSize);
+    const int64_t header_cipher_remaining =
+        header_cipher_total_len - kCiphertextLengthSize;
+    PARQUET_ASSIGN_OR_THROW(
+        auto read_size,
+        input->Read(header_cipher_remaining,

Review Comment:
   `Read(...)` doesn't guarantee a full read in one call:
     
https://github.com/apache/arrow/blob/a315b961cd6ab7b438d02a02f7aee3ff5c0c87c2/cpp/src/arrow/io/interfaces.h#L187-L194
    
   
   I checked the existing Parquet read paths, and they generally use a single 
`Read(...)` followed by a size check.  for example in the existing unencrypted 
bloom filter path and in page reads.
   
   
https://github.com/apache/arrow/blob/a315b961cd6ab7b438d02a02f7aee3ff5c0c87c2/cpp/src/parquet/bloom_filter.cc#L125-L127
   
   
https://github.com/apache/arrow/blob/a315b961cd6ab7b438d02a02f7aee3ff5c0c87c2/cpp/src/parquet/bloom_filter.cc#L165-L169
   
   
https://github.com/apache/arrow/blob/a315b961cd6ab7b438d02a02f7aee3ff5c0c87c2/cpp/src/parquet/column_reader.cc#L456-L459
   
   
   This encrypted bloom filter path follows the same pattern as well. If you'd 
prefer, I can add a small local helper here to read fully.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-48334: Support reading encrypted bloom filters [arrow]

Reply via email to