Tim Armstrong has posted comments on this change.

Change subject: Optimized ReadValueBatch() for Parquet scalar column readers.
......................................................................


Patch Set 3:

(17 comments)

I did a more detailed pass with an eye to readability. I think it's worth 
getting this code as readable as possible, since it's a lot of work to get 
familiar with it.

http://gerrit.cloudera.org:8080/#/c/2843/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 264:     DCHECK(parent_ != NULL);
I think it'd be better not to pass the parent in here - better to keep the 
dependency from HdfsParquetScanner to LevelDecoder one-way rather than having 
them too tangled up. If you keep the mempool, maybe just pass the MemPool in.


Line 281:   inline bool CacheHasNext() const { return cached_level_idx_ != 
num_cached_levels_; }
Nit: cached_level_idx_ < num_cached_levels_ is more obvious to me, but this is 
fine.


Line 791: necessary
If they are empty?


Line 797:       // We only need the repetition levels for populating the 
position slot.
Why is this true? Because we're materializing the top-level tuple?


Line 826: ReadValueBatch
I think it's confusing having the same name as above, since it doesn't actualyl 
do the same thing. Maybe ReadValueBatchInternal(). Or MaterializeValueBatch().


Line 1505:   filename_ = filename.c_str();
It's surprising that this holds a reference to the memory of a const& argument. 
Either document that, or just make filename_ a std::string so that it's copied.

The variable lifetime works ok now (I think) but better to avoid the issue.


Line 1548:   cached_levels_ = reinterpret_cast<uint8_t*>(
The more I think about it the less I think it's worth complicating the code to 
track this until we have an overall solution to untracked memory. This memory 
is it's going to be a small factor of the overall column memory usage (much 
smaller than the buffers) and a small factor of untracked memory usage (thrift 
and llvm are worse offenders).  The untracked memory is only really a big 
problem when it's a significant fraction of overall memory.


Line 1573: if
We're optimising for the non-nullable case here, but not in the value 
materialization. Doesn't have to be this change, but would it make sense to 
optimise for it in both places?

I guess this avoids extra template args.


Line 1580:     if (UNLIKELY(num_cached_levels_ == 0)) return;
This seems overly complicated: why have the DCHECK below if we have to add 
extra code to avoid tripping it.


Line 1594:       if (repeat_count_ == 0 && literal_count_ == 0) {
This is a little confusing since this branch is always taken on  iterations 
after the first, but it reads like this could be taken on any iteration. 
Consider making this aspect of the control flow more obvious at the cost of 
some minor code duplication.

E.g

if (repeat_count_ == 0 && literal_count_ == 0) {
  if (!NextCounts()) return num_values;
}

while (...) {
  if (repeat_count_ > 0 && current_value_ > max_level_) return -1;
  ....

  if (!NextCounts()) return num_values;
}


Line 1598:  -1
Not your change, but maybe make this a constant instead of a magic number?


Line 1603: min
Could make this more concise using min<int64_t>(). Does this need to be an 
int64 anyway?


Line 1603: num_repeat_values
Variable name is a little confusing. repeat_count_to_set? num_to_set?


Line 1610:       while (literal_count_ > 0 && num_values < batch_size) {
For symmetry, could do the same min() calculation as num_repeat_values and 
convert this to a for loop.


Line 1612:         DCHECK(result);
Could invalid input data trip this DCHECK?


Line 1620:     while (num_values < batch_size) {
The termination condition would be more obvious if this was a for() loop. I.e.

for (; num_values < batch_size; ++num_values)


Line 1658: -1
Should we use a similar constant for def_level_ -1. E.g. INVALID_DEF_LEVEL?


-- 
To view, visit http://gerrit.cloudera.org:8080/2843
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21fa9b050a45f2dd45cc0091ea5b008d3c0a3f30
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to