jp0317 commented on code in PR #39818:
URL: https://github.com/apache/arrow/pull/39818#discussion_r1473489191
##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1478,16 +1480,29 @@ class TypedRecordReader : public
TypedColumnReaderImpl<DType>,
// We skipped the levels by incrementing 'levels_position_'. For values
// we do not have a buffer, so we need to read them and throw them away.
// First we need to figure out how many present/not-null values there are.
- std::shared_ptr<::arrow::ResizableBuffer> valid_bits;
- valid_bits = AllocateBuffer(this->pool_);
-
PARQUET_THROW_NOT_OK(valid_bits->Resize(bit_util::BytesForBits(skipped_records),
- /*shrink_to_fit=*/true));
+ int64_t buffer_size = bit_util::BytesForBits(skipped_records);
+ if (valid_bits_for_skip_ == nullptr) {
+ valid_bits_for_skip_ = AllocateBuffer(this->pool_);
+ }
+ if (buffer_size > valid_bits_for_skip_->size()) {
+ // Increase the bitmap size.
+ PARQUET_THROW_NOT_OK(valid_bits_for_skip_->Resize(
+ std::max<int64_t>(buffer_size, kMaxSkipLevelBufferSize),
+ /*shrink_to_fit=*/false));
+ }
ValidityBitmapInputOutput validity_io;
validity_io.values_read_upper_bound = skipped_records;
- validity_io.valid_bits = valid_bits->mutable_data();
+ validity_io.valid_bits = valid_bits_for_skip_->mutable_data();
validity_io.valid_bits_offset = 0;
DefLevelsToBitmap(def_levels() + start_levels_position, skipped_records,
this->leaf_info_, &validity_io);
+ // If the bitmap is too large, free it to avoid consuming too much memory
when there
Review Comment:
it's to keep the old behavior (freeing the bitmap). The # of allocs remains
the same, with the only difference is that resizing to 0 needs the alloc upon
next skip (if it happens) and resizing to 128B performs the allocation in the
end of the current skip (and incurs few cost for copying the 128B data). So i
think mapleFU's suggestion is better than the original implemention.
"Working in chunks" does sound even better (especially given that the
[ReadAndThrowAwayValues](https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1609)
is already doing it in chunks). But i'm not sure if "multiple chunks" can
happen in practice.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]