Alex Behm has posted comments on this change.

Change subject: Optimized ReadValueBatch() for Parquet scalar column readers.
......................................................................


Patch Set 3:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/2843/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 264:     DCHECK(parent_ != NULL);
> I think it'd be better not to pass the parent in here - better to keep the 
Done


Line 281:   inline bool CacheHasNext() const { return cached_level_idx_ != 
num_cached_levels_; }
> Nit: cached_level_idx_ < num_cached_levels_ is more obvious to me, but this
Done


Line 791:       // Fill def/rep level caches if necessary.
> If they are empty?
Done


Line 797:       // We only need the repetition levels for populating the 
position slot.
> Why is this true? Because we're materializing the top-level tuple?
That's correct. Added comment.


Line 826:   bool ReadValueBatch(MemPool* pool, int max_values, int tuple_size,
> I think it's confusing having the same name as above, since it doesn't actu
Agreed, my creativity failed me :)

Using MaterializeValueBatch()


Line 1505:   filename_ = filename.c_str();
> It's surprising that this holds a reference to the memory of a const& argum
Done


Line 1548:   cached_levels_ = reinterpret_cast<uint8_t*>(
> The more I think about it the less I think it's worth complicating the code
Happy to change it, but I really want to we sure we are not  adding risk here 
so let's get a second opinion (Marcel, Dan?)


Line 1573:   if (max_level_ > 0) {
> We're optimising for the non-nullable case here, but not in the value mater
This is mostly for simplicity and correctness because in the non-nullable case 
there is no def level data to read. In practice, most of the data is nullable 
because you can only create non-nullable data outside of Hive/Impala, so let's 
optimize the value materialziation later.


Line 1580:     if (UNLIKELY(num_cached_levels_ == 0)) return;
> This seems overly complicated: why have the DCHECK below if we have to add 
Done


Line 1594:       if (repeat_count_ == 0 && literal_count_ == 0) {
> This is a little confusing since this branch is always taken on  iterations
Restructured as discussed.


Line 1598:           return -1;
> Not your change, but maybe make this a constant instead of a magic number?
Done


Line 1603:       int64_t num_repeat_values = 
min(static_cast<int64_t>(repeat_count_),
> Could make this more concise using min<int64_t>(). Does this need to be an 
Done. Using uint32_t because that's the type of repeat_count_


Line 1603:       int64_t num_repeat_values = 
min(static_cast<int64_t>(repeat_count_),
> Variable name is a little confusing. repeat_count_to_set? num_to_set?
Done


Line 1610:       while (literal_count_ > 0 && num_values < batch_size) {
> For symmetry, could do the same min() calculation as num_repeat_values and 
Done


Line 1612:         DCHECK(result);
> Could invalid input data trip this DCHECK?
I think so. Removed DCHECK and added error check.


Line 1620:     while (num_values < batch_size) {
> The termination condition would be more obvious if this was a for() loop. I
Done


Line 1658:     def_level_ = -1;
> Should we use a similar constant for def_level_ -1. E.g. INVALID_DEF_LEVEL?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2843
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21fa9b050a45f2dd45cc0091ea5b008d3c0a3f30
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to