jp0317 commented on code in PR #39818:
URL: https://github.com/apache/arrow/pull/39818#discussion_r1471755884


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -70,6 +70,8 @@ namespace {
 // The minimum number of repetition/definition levels to decode at a time, for
 // better vectorized performance when doing many smaller record reads
 constexpr int64_t kMinLevelBatchSize = 1024;
+// The max buffer size of validility bitmap for skipping buffered levels.
+constexpr int64_t kMaxSkipLevelBufferSize = 128;

Review Comment:
   yeah, for large batch the buffer is always expanded-then-shrinked. 
Performance-wise it's probably is fine (as shown in the 1/5000/1000000 test 
where the batch size is 5k) as the original behavior is allocate-then-free. And 
if large batch happens there would be fewer rounds of reads/skips (e.g., for an 
1M-row group it won't have more than 1K shrinks when batch size is 1025). So 
even if single skip incurs some overhead on the "shrink" part, the overall 
impact should be small.  Memory-wise that's why i set it as 128 to avoid 
wasting too much memory.  Alternatively we can set a global budget evenly 
shared by columns.



##########
cpp/src/parquet/column_reader.cc:
##########
@@ -1478,16 +1480,28 @@ class TypedRecordReader : public 
TypedColumnReaderImpl<DType>,
     // We skipped the levels by incrementing 'levels_position_'. For values
     // we do not have a buffer, so we need to read them and throw them away.
     // First we need to figure out how many present/not-null values there are.
-    std::shared_ptr<::arrow::ResizableBuffer> valid_bits;
-    valid_bits = AllocateBuffer(this->pool_);
-    
PARQUET_THROW_NOT_OK(valid_bits->Resize(bit_util::BytesForBits(skipped_records),
-                                            /*shrink_to_fit=*/true));
+    int64_t buffer_size = bit_util::BytesForBits(skipped_records);
+    if (valid_bits_for_skip_ == nullptr) {
+      // Preallocate kMaxSkipLevelBufferSize would help minimizing allocations.
+      valid_bits_for_skip_ = AllocateBuffer(
+          this->pool_, std::max<int64_t>(buffer_size, 
kMaxSkipLevelBufferSize));
+    } else if (buffer_size > kMaxSkipLevelBufferSize) {
+      // Increase the bitmap size.
+      PARQUET_THROW_NOT_OK(valid_bits_for_skip_->Resize(buffer_size,
+                                                        
/*shrink_to_fit=*/false));

Review Comment:
   the shrink would be meaningless as it's actually expanding?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to