Tim Armstrong has posted comments on this change. Change subject: Optimized ReadValueBatch() for Parquet scalar column readers. ......................................................................
Patch Set 3: (17 comments) I did a more detailed pass with an eye to readability. I think it's worth getting this code as readable as possible, since it's a lot of work to get familiar with it. http://gerrit.cloudera.org:8080/#/c/2843/3/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 264: DCHECK(parent_ != NULL); I think it'd be better not to pass the parent in here - better to keep the dependency from HdfsParquetScanner to LevelDecoder one-way rather than having them too tangled up. If you keep the mempool, maybe just pass the MemPool in. Line 281: inline bool CacheHasNext() const { return cached_level_idx_ != num_cached_levels_; } Nit: cached_level_idx_ < num_cached_levels_ is more obvious to me, but this is fine. Line 791: necessary If they are empty? Line 797: // We only need the repetition levels for populating the position slot. Why is this true? Because we're materializing the top-level tuple? Line 826: ReadValueBatch I think it's confusing having the same name as above, since it doesn't actualyl do the same thing. Maybe ReadValueBatchInternal(). Or MaterializeValueBatch(). Line 1505: filename_ = filename.c_str(); It's surprising that this holds a reference to the memory of a const& argument. Either document that, or just make filename_ a std::string so that it's copied. The variable lifetime works ok now (I think) but better to avoid the issue. Line 1548: cached_levels_ = reinterpret_cast<uint8_t*>( The more I think about it the less I think it's worth complicating the code to track this until we have an overall solution to untracked memory. This memory is it's going to be a small factor of the overall column memory usage (much smaller than the buffers) and a small factor of untracked memory usage (thrift and llvm are worse offenders). The untracked memory is only really a big problem when it's a significant fraction of overall memory. Line 1573: if We're optimising for the non-nullable case here, but not in the value materialization. Doesn't have to be this change, but would it make sense to optimise for it in both places? I guess this avoids extra template args. Line 1580: if (UNLIKELY(num_cached_levels_ == 0)) return; This seems overly complicated: why have the DCHECK below if we have to add extra code to avoid tripping it. Line 1594: if (repeat_count_ == 0 && literal_count_ == 0) { This is a little confusing since this branch is always taken on iterations after the first, but it reads like this could be taken on any iteration. Consider making this aspect of the control flow more obvious at the cost of some minor code duplication. E.g if (repeat_count_ == 0 && literal_count_ == 0) { if (!NextCounts()) return num_values; } while (...) { if (repeat_count_ > 0 && current_value_ > max_level_) return -1; .... if (!NextCounts()) return num_values; } Line 1598: -1 Not your change, but maybe make this a constant instead of a magic number? Line 1603: min Could make this more concise using min<int64_t>(). Does this need to be an int64 anyway? Line 1603: num_repeat_values Variable name is a little confusing. repeat_count_to_set? num_to_set? Line 1610: while (literal_count_ > 0 && num_values < batch_size) { For symmetry, could do the same min() calculation as num_repeat_values and convert this to a for loop. Line 1612: DCHECK(result); Could invalid input data trip this DCHECK? Line 1620: while (num_values < batch_size) { The termination condition would be more obvious if this was a for() loop. I.e. for (; num_values < batch_size; ++num_values) Line 1658: -1 Should we use a similar constant for def_level_ -1. E.g. INVALID_DEF_LEVEL? -- To view, visit http://gerrit.cloudera.org:8080/2843 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I21fa9b050a45f2dd45cc0091ea5b008d3c0a3f30 Gerrit-PatchSet: 3 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
