Alex Behm has posted comments on this change. Change subject: Optimized ReadValueBatch() for Parquet scalar column readers. ......................................................................
Patch Set 3: (17 comments) http://gerrit.cloudera.org:8080/#/c/2843/3/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 264: DCHECK(parent_ != NULL); > I think it'd be better not to pass the parent in here - better to keep the Done Line 281: inline bool CacheHasNext() const { return cached_level_idx_ != num_cached_levels_; } > Nit: cached_level_idx_ < num_cached_levels_ is more obvious to me, but this Done Line 791: // Fill def/rep level caches if necessary. > If they are empty? Done Line 797: // We only need the repetition levels for populating the position slot. > Why is this true? Because we're materializing the top-level tuple? That's correct. Added comment. Line 826: bool ReadValueBatch(MemPool* pool, int max_values, int tuple_size, > I think it's confusing having the same name as above, since it doesn't actu Agreed, my creativity failed me :) Using MaterializeValueBatch() Line 1505: filename_ = filename.c_str(); > It's surprising that this holds a reference to the memory of a const& argum Done Line 1548: cached_levels_ = reinterpret_cast<uint8_t*>( > The more I think about it the less I think it's worth complicating the code Happy to change it, but I really want to we sure we are not adding risk here so let's get a second opinion (Marcel, Dan?) Line 1573: if (max_level_ > 0) { > We're optimising for the non-nullable case here, but not in the value mater This is mostly for simplicity and correctness because in the non-nullable case there is no def level data to read. In practice, most of the data is nullable because you can only create non-nullable data outside of Hive/Impala, so let's optimize the value materialziation later. Line 1580: if (UNLIKELY(num_cached_levels_ == 0)) return; > This seems overly complicated: why have the DCHECK below if we have to add Done Line 1594: if (repeat_count_ == 0 && literal_count_ == 0) { > This is a little confusing since this branch is always taken on iterations Restructured as discussed. Line 1598: return -1; > Not your change, but maybe make this a constant instead of a magic number? Done Line 1603: int64_t num_repeat_values = min(static_cast<int64_t>(repeat_count_), > Could make this more concise using min<int64_t>(). Does this need to be an Done. Using uint32_t because that's the type of repeat_count_ Line 1603: int64_t num_repeat_values = min(static_cast<int64_t>(repeat_count_), > Variable name is a little confusing. repeat_count_to_set? num_to_set? Done Line 1610: while (literal_count_ > 0 && num_values < batch_size) { > For symmetry, could do the same min() calculation as num_repeat_values and Done Line 1612: DCHECK(result); > Could invalid input data trip this DCHECK? I think so. Removed DCHECK and added error check. Line 1620: while (num_values < batch_size) { > The termination condition would be more obvious if this was a for() loop. I Done Line 1658: def_level_ = -1; > Should we use a similar constant for def_level_ -1. E.g. INVALID_DEF_LEVEL? Done -- To view, visit http://gerrit.cloudera.org:8080/2843 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I21fa9b050a45f2dd45cc0091ea5b008d3c0a3f30 Gerrit-PatchSet: 3 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
