[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

Csaba Ringhofer (Code Review) Sat, 10 Nov 2018 11:45:50 -0800

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )


Change subject: IMPALA-4123: Columnar decoding in Parquet
......................................................................


Patch Set 17:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@165
PS17, Line 165: numeric_limits<int32_t>::max();
Using such large values worries me a bit, it is easy to create an overflow with 
innocent looking code ( e.g. if (r.NextRepeatedRunLength() + buffered_values < 
output_buffer_size) ).

I would prefer to give it a valid value, like have a member with the number of 
values in the page.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@648
PS17, Line 648: materializing
This seems a bit misleading to me: we do not have to materialize the batch to 
know num_def_levels_to_consume, just check some other conditions (mainly 
rep_levels_.CacheRemaining())/


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@667
PS17, Line 667: num_buffered_values_
num_buffered_values_ >= def_level_repeats should be always true at this point 
in a valid Parquet file, as num_buffered_values_ means the remaining (non NULL) 
values in the page, and def levels of the page should not contain information 
about rows outside the page, so I think that the handling of the 
num_buffered_values_ < def_level_repeats case could go to an UNLIKELY branch.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@769
PS17, Line 769: if (UNLIKELY(!ConvertSlot(curr_val, 
tuple->GetSlot(tuple_offset_)))) {
Doing the conversion after the validation check means that timestamps that were 
moved out of range by the conversion do not set the NULL indicator.

This was not changed by your patch, but a TODO could be added.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@834
PS17, Line 834:     StrideWriter<InternalType> out{out_vals, stride};
              :     // Hoist data_ into a local variable to prevent GCC from 
storing it every loop
              :     // iteration.
              :     uint8_t* data = data_;
              :     for (int64_t i = 0; i < count; ++i, out.SkipNext(1)) {
              :       if (UNLIKELY(!DecodeValue<Encoding::PLAIN>(&data, 
data_end_, out.current))) {
              :         return false;
              :       }
              :     }
              :     data_ = data;
I think that ParquetPlainEncoder would be a better place for this logic, e.g. 
with a int DecodeBatch(const uint8_t* buffer, const uint8_t* buffer_end, int 
fixed_len_size, int64_t stride,
 InternalType* v) functions that calls ParquetPlainEncoder::Decode and returns 
the total encoded_length.

I generally prefer to move logic out from parquet-column-readers.cc when it 
make sense, as this file already contains a lot of complex stuff, which makes 
it hard to get a "big picture" about parquet reading.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@859
PS17, Line 859:   int pos_slot_offset = pos_slot_desc()->tuple_offset();
              :   for (int64_t i = 0; i < num_to_read; ++i, tuple_mem += 
tuple_size) {
              :     Tuple* tuple = reinterpret_cast<Tuple*>(tuple_mem);
              :     ReadPositionBatched(rep_levels_.CacheGetNext(),
              :         
reinterpret_cast<int64_t*>(tuple->GetSlot(pos_slot_offset)));
              :   }
This could be also simplified a bit by using stride writer.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h
File be/src/util/mem-util.h:

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h@41
PS17, Line 41: // memcpy() is necessary because 'current' may not be aligned.
In which case can it point the unaligned memory? Normally it should point to 
slots in tuples, and those should have the proper alignment for the type, or at 
least that's what I thought until now.

Another reason for using memcpy instead of assignment could be that we are 
avoiding the copy constructor for types that have one.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 17
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Sat, 10 Nov 2018 19:45:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

Reply via email to