Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12621 )
Change subject: IMPALA-8253: Parquet delta encoding and decoding. ...................................................................... Patch Set 19: (8 comments) http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/benchmarks/parquet-delta-benchmark.cc File be/src/benchmarks/parquet-delta-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/benchmarks/parquet-delta-benchmark.cc@1060 PS19, Line 1060: const std::vector<int> strides = {4, 8, 12, 16, 20, 30, 40, 50, 80, 100, 120, 150, 180, 200, 400}; nit: long line http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/benchmarks/parquet-delta-benchmark.cc@1062 PS19, Line 1062: /// This is used to tune which configurations should be measured. : auto filter = [] (const Config& config) -> bool { : return true : && config.parquet_type == Config::INT32 : && config.out_type == Config::int32 : // && (int) config.parquet_type == (int) config.out_type : && config.access == Config::BATCH : // && config.encoding == Config::PLAIN && config.mean_delta == 1 : && (config.stride == 4 /* || config.stride == 8 || config.stride == 20 || config.stride == 100 || config.stride == 400 */) : && config.mean_delta == 1 : ; What is the long term intention with this filter? It would be nice to find 8-16 combinations to run and insert the results to the beginning of this file http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-decoder.cc File be/src/exec/parquet/parquet-delta-decoder.cc: http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-decoder.cc@181 PS19, Line 181: num_buffered_values_ - next_buffered_value_index_; nit: +2 indentation http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-decoder.cc@298 PS19, Line 298: template int ParquetDeltaDecoder<int32_t>::NextValuesConverted<int8_t>(int num_values, > I'm not sure it is a good idea to explicitly instantiate NextValuesConverte We may be able to remove most int64_t ones, see my comment in bit-packing.cc http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-encoder.h File be/src/exec/parquet/parquet-delta-encoder.h: http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-encoder.h@54 PS19, Line 54: static constexpr int MAX_TOTAL_VALUE_COUNT = 16000; Note that there is a query option for setting max row count in page: parquet_page_row_count_limit http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-encoder.cc File be/src/exec/parquet/parquet-delta-encoder.cc: http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-encoder.cc@117 PS19, Line 117: std::memmove(new_data_start_address, old_data_start_address, data_len); Can you skip this if reserved_space_for_header_ == actual_header_size? Even if memmove would optimize this case I think that the code would be clearer. Alternatively we could avoid the copy and point to a bit later point, similarly to Arrow http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-encoder.cc@289 PS19, Line 289: miniblock_index * miniblock_size_in_values_; nit: indentation http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/util/bit-packing.cc File be/src/util/bit-packing.cc: http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/util/bit-packing.cc@75 PS19, Line 75: INSTANTIATE_UNPACK_AND_DELTA_DECODE(int8_t, int64_t); : INSTANTIATE_UNPACK_AND_DELTA_DECODE(int16_t, int64_t); : INSTANTIATE_UNPACK_AND_DELTA_DECODE(int32_t, int64_t); Do we need these combinations? We only seem to support BIGINT with int64 Parquet columns: https://github.com/apache/impala/blob/cf28a4c5292fdb3504d1fe11183c78243ed148a4/be/src/exec/parquet/parquet-metadata-utils.cc#L54 -- To view, visit http://gerrit.cloudera.org:8080/12621 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie7378ac1a490a6c89a0a4349aae86cbc0fbc80f8 Gerrit-Change-Number: 12621 Gerrit-PatchSet: 19 Gerrit-Owner: Daniel Becker <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Wed, 17 May 2023 17:21:19 +0000 Gerrit-HasComments: Yes
