Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 )
Change subject: IMPALA-4123 (prep): Parquet column reader cleanup ...................................................................... Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@518 PS1, Line 518: if (def_level < def_level_of_immediate_repeated_ancestor()) { : // A containing repeated field is empty or NULL. Skip the value but : // move to the next repetition level if necessary. : if (pos_slot_desc_ != nullptr) rep_levels_.CacheSkipLevels(1); : continue; : } : if (pos_slot_desc_ != nullptr) { : ReadPositionBatched(rep_levels_.CacheGetNext(), : static_cast<int64_t*>(tuple->GetSlot(pos_slot_desc_->tuple_offset()))); : } > I would move this whole block to ReadPositionBatched, and rename it to some I think the current structure is the lesser of evils for a couple of reads: * It is tied into the control flow of the loop, since we use "continue" to jump to the top of the loop based on the def level. I don't see a clean way to factor out the def_level < .. branch. * I want to use ReadPositionBatched() as a helper function in a follow-on patch. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@531 PS1, Line 531: def_level >= max_def_level() > This was not changed, but I am curious: can def_level be greater than max_d It shouldn't be possible in a valid file. It would break our decoding in a lot of places, since we infer the maximum from the schema, then use that to determine the bit width when decoding levels. This predates me working on the code, but I suspect the looser check was used to avoid crashing without paying any extra runtime overhead to validate each value. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@586 PS1, Line 586: if (!continue_execution) return false; > Replacing this with !parent_->parse_status_.ok() and removing continue_exec Done http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@597 PS1, Line 597: template <Encoding::type ENCODING> > This could be split to two specialized template functions. Ahh, I missed this. Unfortunately it looks like there's some reason that isn't not possible to specialize a member function template when the class template is unspecialised - I'm getting this from clang: template <typename InternalType, parquet::Type::type PARQUET_TYPE, bool MATERIALIZED> template <> bool ScalarColumnReader<InternalType, PARQUET_TYPE, MATERIALIZED>::DecodeValue<Encoding::PLAIN>( error| cannot specialize (with 'template<>') a member of an unspecialized template If there's an alternative you know of, I'm open to it. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@600 PS1, Line 600: MemPool* RESTRICT pool > "pool" could be removed, as it is not used in the function (ReadSlot() used Oh good point. I removed the pool argument. I'll hold off on making the other change. I can see the argument for it, but one advantage of the current code is that it's more obvious in HdfsParquetScanner::AssembleRows() that ReadValueBatch() allocates memory from aux_mem_pool, so the memory lifetime is a bit clearer. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@1320 PS1, Line 1320: return ReadSlot(static_cast<CollectionValue*>(tuple->GetSlot(tuple_offset_)), pool); > Tuple has a GetCollectionSlot() functions that does the cast. Some new type Ah I forgot about those functions, good catch. Added some new functions. I used GetBigIntVal() since I think the name should reflect the logical type of the slot. -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Tue, 27 Mar 2018 00:55:42 +0000 Gerrit-HasComments: Yes