(impala) 01/02: IMPALA-12137: Fix skipping parquet data copy for dict pages

michaelsmith Thu, 19 Mar 2026 22:01:38 -0700

This is an automated email from the ASF dual-hosted git repository.

michaelsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git


commit ab22511520f122db617bd08685d6fa11c4b36668
Author: Balazs Hevele <[email protected]>
AuthorDate: Thu Mar 19 13:42:39 2026 +0100

    IMPALA-12137: Fix skipping parquet data copy for dict pages
    
    This commit fixes skipping the copying of data pages read from parquet
    files when it has dictionary encoding.
    The previous commit used a value from a parameter that was an output
    parameter.
    
    Skips some memory allocation when reading parquet files in a very
    specific case: for uncompressed data pages of var len strings, having
    dictionary encoding. In this case, there is no need to allocate a copy
    of the data buffer for strings to point into, because they will point
    into the dictionary.
    
    Measurements:
    Measured peak memory with the following query:
      select count(distinct city) from functional_parquet.airports_parquet;
    Peak Memory of SCAN HDFS dropped from 425.75KB to 399.79KB.
    
    Change-Id: I3c6dfaeb5d2b7addbcd8ad663271131ec8608003
    Reviewed-on: http://gerrit.cloudera.org:8080/24117
    Reviewed-by: Impala Public Jenkins <[email protected]>
    Tested-by: Impala Public Jenkins <[email protected]>
---
 be/src/exec/parquet/parquet-column-chunk-reader.cc | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/be/src/exec/parquet/parquet-column-chunk-reader.cc 
b/be/src/exec/parquet/parquet-column-chunk-reader.cc
index 01a9a8c0c..23edffe54 100644
--- a/be/src/exec/parquet/parquet-column-chunk-reader.cc
+++ b/be/src/exec/parquet/parquet-column-chunk-reader.cc
@@ -344,6 +344,9 @@ Status 
ParquetColumnChunkReader::ReadDataPageData(DataPageInfo* page_info) {
   }
 
   const bool has_slot_desc = value_mem_type_ != ValueMemoryType::NO_SLOT_DESC;
+  const parquet::Encoding::type data_encoding = is_v2 ?
+      header.data_page_header_v2.encoding :
+      header.data_page_header.encoding;
 
   int data_size = uncompressed_size;
   uint8_t* data = nullptr;
@@ -385,7 +388,7 @@ Status 
ParquetColumnChunkReader::ReadDataPageData(DataPageInfo* page_info) {
     // If data page is dict encoded, strings will point to the dictionary 
instead of
     // the data buffer, so there is no need to make a copy of page data.
     const bool copy_buffer = (value_mem_type_ == ValueMemoryType::VAR_LEN_STR) 
&&
-        !IsDictionaryEncoding(page_info->data_encoding);
+        !IsDictionaryEncoding(data_encoding);
 
     if (copy_buffer) {
       // In this case returned batches will have pointers into the data page 
itself.
@@ -415,8 +418,7 @@ Status 
ParquetColumnChunkReader::ReadDataPageData(DataPageInfo* page_info) {
         &data, &data_size));
   }
 
-  page_info->data_encoding = is_v2 ? header.data_page_header_v2.encoding
-                                   : header.data_page_header.encoding;
+  page_info->data_encoding = data_encoding;
   page_info->data_ptr = data;
   page_info->data_size = data_size;
   page_info->is_valid = true;

(impala) 01/02: IMPALA-12137: Fix skipping parquet data copy for dict pages

Reply via email to