Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@518
PS1, Line 518:       if (def_level < 
def_level_of_immediate_repeated_ancestor()) {
             :         // A containing repeated field is empty or NULL. Skip 
the value but
             :         // move to the next repetition level if necessary.
             :         if (pos_slot_desc_ != nullptr) 
rep_levels_.CacheSkipLevels(1);
             :         continue;
             :       }
             :       if (pos_slot_desc_ != nullptr) {
             :         ReadPositionBatched(rep_levels_.CacheGetNext(),
             :             
static_cast<int64_t*>(tuple->GetSlot(pos_slot_desc_->tuple_offset())));
             :       }
> I would move this whole block to ReadPositionBatched, and rename it to some
I think the current structure is the lesser of evils for a couple of reads:
* It is tied into the control flow of the loop, since we use "continue" to jump 
to the top of the loop based on the def level. I don't see a clean way to 
factor out the def_level < .. branch.
* I want to use ReadPositionBatched() as a helper function in a follow-on patch.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@531
PS1, Line 531: def_level >= max_def_level()
> This was not changed, but I am curious: can def_level be greater than max_d
It shouldn't be possible in a valid file. It would break our decoding in a lot 
of places, since we infer the maximum from the schema, then use that to 
determine the bit width when decoding levels.

This predates me working on the code, but I suspect the looser check was used 
to avoid crashing without paying any extra runtime overhead to validate each 
value.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@586
PS1, Line 586:     if (!continue_execution) return false;
> Replacing this with !parent_->parse_status_.ok() and removing continue_exec
Done


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@597
PS1, Line 597: template <Encoding::type ENCODING>
> This could be split to two specialized template functions.
Ahh, I missed this. Unfortunately it looks like there's some reason that isn't 
not possible to specialize a member function template when the class template 
is unspecialised - I'm getting this from clang:

  template <typename InternalType, parquet::Type::type PARQUET_TYPE, bool 
MATERIALIZED>
  template <>
  bool ScalarColumnReader<InternalType, PARQUET_TYPE, 
MATERIALIZED>::DecodeValue<Encoding::PLAIN>(

 error| cannot specialize (with 'template<>') a member of an unspecialized 
template


If there's an alternative you know of, I'm open to it.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@600
PS1, Line 600: MemPool* RESTRICT pool
> "pool" could be removed, as it is not used in the function (ReadSlot() used
Oh good point. I removed the pool argument.

I'll hold off on making the other change. I can see the argument for it, but 
one advantage of the current code is that it's more obvious in 
HdfsParquetScanner::AssembleRows() that ReadValueBatch() allocates memory from 
aux_mem_pool, so the memory lifetime is a bit clearer.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@1320
PS1, Line 1320:     return 
ReadSlot(static_cast<CollectionValue*>(tuple->GetSlot(tuple_offset_)), pool);
> Tuple has a GetCollectionSlot() functions that does the cast. Some new type
Ah I forgot about those functions, good catch. Added some new functions. I used 
GetBigIntVal() since I think the name should reflect the logical type of the 
slot.



--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Tue, 27 Mar 2018 00:55:42 +0000
Gerrit-HasComments: Yes

Reply via email to