This is an automated email from the ASF dual-hosted git repository. michaelsmith pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/impala.git
commit ab22511520f122db617bd08685d6fa11c4b36668 Author: Balazs Hevele <[email protected]> AuthorDate: Thu Mar 19 13:42:39 2026 +0100 IMPALA-12137: Fix skipping parquet data copy for dict pages This commit fixes skipping the copying of data pages read from parquet files when it has dictionary encoding. The previous commit used a value from a parameter that was an output parameter. Skips some memory allocation when reading parquet files in a very specific case: for uncompressed data pages of var len strings, having dictionary encoding. In this case, there is no need to allocate a copy of the data buffer for strings to point into, because they will point into the dictionary. Measurements: Measured peak memory with the following query: select count(distinct city) from functional_parquet.airports_parquet; Peak Memory of SCAN HDFS dropped from 425.75KB to 399.79KB. Change-Id: I3c6dfaeb5d2b7addbcd8ad663271131ec8608003 Reviewed-on: http://gerrit.cloudera.org:8080/24117 Reviewed-by: Impala Public Jenkins <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> --- be/src/exec/parquet/parquet-column-chunk-reader.cc | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/be/src/exec/parquet/parquet-column-chunk-reader.cc b/be/src/exec/parquet/parquet-column-chunk-reader.cc index 01a9a8c0c..23edffe54 100644 --- a/be/src/exec/parquet/parquet-column-chunk-reader.cc +++ b/be/src/exec/parquet/parquet-column-chunk-reader.cc @@ -344,6 +344,9 @@ Status ParquetColumnChunkReader::ReadDataPageData(DataPageInfo* page_info) { } const bool has_slot_desc = value_mem_type_ != ValueMemoryType::NO_SLOT_DESC; + const parquet::Encoding::type data_encoding = is_v2 ? + header.data_page_header_v2.encoding : + header.data_page_header.encoding; int data_size = uncompressed_size; uint8_t* data = nullptr; @@ -385,7 +388,7 @@ Status ParquetColumnChunkReader::ReadDataPageData(DataPageInfo* page_info) { // If data page is dict encoded, strings will point to the dictionary instead of // the data buffer, so there is no need to make a copy of page data. const bool copy_buffer = (value_mem_type_ == ValueMemoryType::VAR_LEN_STR) && - !IsDictionaryEncoding(page_info->data_encoding); + !IsDictionaryEncoding(data_encoding); if (copy_buffer) { // In this case returned batches will have pointers into the data page itself. @@ -415,8 +418,7 @@ Status ParquetColumnChunkReader::ReadDataPageData(DataPageInfo* page_info) { &data, &data_size)); } - page_info->data_encoding = is_v2 ? header.data_page_header_v2.encoding - : header.data_page_header.encoding; + page_info->data_encoding = data_encoding; page_info->data_ptr = data; page_info->data_size = data_size; page_info->is_valid = true;
