fenfeng9 commented on code in PR #49334:
URL: https://github.com/apache/arrow/pull/49334#discussion_r2932896741
##########
cpp/src/parquet/bloom_filter.cc:
##########
@@ -106,7 +126,100 @@ static ::arrow::Status ValidateBloomFilterHeader(
BlockSplitBloomFilter BlockSplitBloomFilter::Deserialize(
const ReaderProperties& properties, ArrowInputStream* input,
- std::optional<int64_t> bloom_filter_length) {
+ std::optional<int64_t> bloom_filter_length, Decryptor* header_decryptor,
+ Decryptor* bitset_decryptor) {
+ if (header_decryptor != nullptr || bitset_decryptor != nullptr) {
+ if (header_decryptor == nullptr || bitset_decryptor == nullptr) {
+ throw ParquetException(
+ "Bloom filter decryptors must be both provided or both null");
+ }
+
+ // Encrypted path: header and bitset are encrypted separately.
+ ThriftDeserializer deserializer(properties);
+ format::BloomFilterHeader header;
+
+ // Read the length-prefixed ciphertext for the header.
+ PARQUET_ASSIGN_OR_THROW(auto length_buf,
input->Read(kCiphertextLengthSize));
+ if (ARROW_PREDICT_FALSE(length_buf->size() < kCiphertextLengthSize)) {
+ throw ParquetException("Bloom filter header read failed: not enough
data");
+ }
+
+ const int64_t header_cipher_total_len =
+ ParseCiphertextTotalLength(length_buf->data(), length_buf->size());
+ if (ARROW_PREDICT_FALSE(header_cipher_total_len >
+ std::numeric_limits<int32_t>::max())) {
+ throw ParquetException("Bloom filter header ciphertext length overflows
int32");
+ }
+ if (bloom_filter_length && header_cipher_total_len > *bloom_filter_length)
{
+ throw ParquetException(
+ "Bloom filter length less than encrypted bloom filter header
length");
+ }
+ // Read the full header ciphertext and decrypt the Thrift header.
+ auto header_cipher_buf =
+ AllocateBuffer(properties.memory_pool(), header_cipher_total_len);
+ std::memcpy(header_cipher_buf->mutable_data(), length_buf->data(),
+ kCiphertextLengthSize);
+ const int64_t header_cipher_remaining =
+ header_cipher_total_len - kCiphertextLengthSize;
+ PARQUET_ASSIGN_OR_THROW(
+ auto read_size,
+ input->Read(header_cipher_remaining,
Review Comment:
`Read(...)` doesn't guarantee a full read in one call:
https://github.com/apache/arrow/blob/a315b961cd6ab7b438d02a02f7aee3ff5c0c87c2/cpp/src/arrow/io/interfaces.h#L187-L194
I checked the existing Parquet read paths, and they generally use a single
`Read(...)` followed by a size check. for example in the existing unencrypted
bloom filter path and in page reads.
https://github.com/apache/arrow/blob/a315b961cd6ab7b438d02a02f7aee3ff5c0c87c2/cpp/src/parquet/bloom_filter.cc#L125-L127
https://github.com/apache/arrow/blob/a315b961cd6ab7b438d02a02f7aee3ff5c0c87c2/cpp/src/parquet/bloom_filter.cc#L165-L169
https://github.com/apache/arrow/blob/a315b961cd6ab7b438d02a02f7aee3ff5c0c87c2/cpp/src/parquet/column_reader.cc#L456-L459
This encrypted bloom filter path follows the same pattern as well. If you'd
prefer, I can add a small local helper here to read fully.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]