Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/8319 )
Change subject: IMPALA-4123: Columnar decoding in Parquet ...................................................................... Patch Set 17: (7 comments) http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@165 PS17, Line 165: numeric_limits<int32_t>::max(); Using such large values worries me a bit, it is easy to create an overflow with innocent looking code ( e.g. if (r.NextRepeatedRunLength() + buffered_values < output_buffer_size) ). I would prefer to give it a valid value, like have a member with the number of values in the page. http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@648 PS17, Line 648: materializing This seems a bit misleading to me: we do not have to materialize the batch to know num_def_levels_to_consume, just check some other conditions (mainly rep_levels_.CacheRemaining())/ http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@667 PS17, Line 667: num_buffered_values_ num_buffered_values_ >= def_level_repeats should be always true at this point in a valid Parquet file, as num_buffered_values_ means the remaining (non NULL) values in the page, and def levels of the page should not contain information about rows outside the page, so I think that the handling of the num_buffered_values_ < def_level_repeats case could go to an UNLIKELY branch. http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@769 PS17, Line 769: if (UNLIKELY(!ConvertSlot(curr_val, tuple->GetSlot(tuple_offset_)))) { Doing the conversion after the validation check means that timestamps that were moved out of range by the conversion do not set the NULL indicator. This was not changed by your patch, but a TODO could be added. http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@834 PS17, Line 834: StrideWriter<InternalType> out{out_vals, stride}; : // Hoist data_ into a local variable to prevent GCC from storing it every loop : // iteration. : uint8_t* data = data_; : for (int64_t i = 0; i < count; ++i, out.SkipNext(1)) { : if (UNLIKELY(!DecodeValue<Encoding::PLAIN>(&data, data_end_, out.current))) { : return false; : } : } : data_ = data; I think that ParquetPlainEncoder would be a better place for this logic, e.g. with a int DecodeBatch(const uint8_t* buffer, const uint8_t* buffer_end, int fixed_len_size, int64_t stride, InternalType* v) functions that calls ParquetPlainEncoder::Decode and returns the total encoded_length. I generally prefer to move logic out from parquet-column-readers.cc when it make sense, as this file already contains a lot of complex stuff, which makes it hard to get a "big picture" about parquet reading. http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@859 PS17, Line 859: int pos_slot_offset = pos_slot_desc()->tuple_offset(); : for (int64_t i = 0; i < num_to_read; ++i, tuple_mem += tuple_size) { : Tuple* tuple = reinterpret_cast<Tuple*>(tuple_mem); : ReadPositionBatched(rep_levels_.CacheGetNext(), : reinterpret_cast<int64_t*>(tuple->GetSlot(pos_slot_offset))); : } This could be also simplified a bit by using stride writer. http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h File be/src/util/mem-util.h: http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h@41 PS17, Line 41: // memcpy() is necessary because 'current' may not be aligned. In which case can it point the unaligned memory? Normally it should point to slots in tuples, and those should have the proper alignment for the type, or at least that's what I thought until now. Another reason for using memcpy instead of assignment could be that we are avoiding the copy constructor for types that have one. -- To view, visit http://gerrit.cloudera.org:8080/8319 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef Gerrit-Change-Number: 8319 Gerrit-PatchSet: 17 Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Sat, 10 Nov 2018 19:45:09 +0000 Gerrit-HasComments: Yes