[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Matthew Jacobs (Code Review) Wed, 14 Sep 2016 12:59:31 -0700

Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................



Patch Set 4:

(15 comments)

I was hoping to get to +1 this pass but there's a bit more still...

Most importantly there's a bug in computing max and I think we need unit tests.

http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS4, Line 196: RuntimeProfile::Counter 
single_footer_process_timer(TUnit::TIME_NS);
this is unnecessary, SCOPED_TIMER below just creates a ScopedTimer which 
creates a MonotonicStopWatch and adds the result of it to this counter, so you 
might as well just use MonotonicStopWatch directly.


http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS4, Line 49: ADD_MIN_MAX_AVG_VALUE_TIMER
please rename


PS4, Line 70: ADD_MIN_MAX_AVG_VALUE_TIMER
same


PS4, Line 147: .
can you add:
This is used to aggregate child counters.


PS4, Line 204: /// Contrary to the AveragedCounter, this only keeps a track of 
values and not 'Counter'
             : /// objects themselves.
this is a bit misleading because it doesn't really keep values, it keeps the 
statistics about the values.

How about:
Contrary to the AveragedCoutner, this keeps statistics about raw values whereas 
AveragedCounter is used for averaging child counters.


PS4, Line 249:  /// Current sums of values seen so far and min/max values seen 
so far.
             :   int64_t current_min_;
             :   int64_t current_max_;
             :   int64_t current_sum_;
current is unnecessary


http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS4, Line 301:       const TSummaryStatsCounter& c = 
node.summary_stats_counters[i];
             :       SummaryStatsCounterMap::iterator it = 
summary_stats_map_.find(c.name);
             :       if (it == summary_stats_map_.end()) {
             :         summary_stats_map_[c.name] =
             :             pool_->Add(new SummaryStatsCounter(
             :                 c.unit, c.total_num_values, c.min_value, 
c.max_value, c.sum));
             :       } else {
             :         it->second->SetStats(c);
             :       }
it's hard to know if this gets exercised in your test case.

Can you add a test in runtime-profile-test.cc that sets up a stats counter and 
goes through a merge?


PS4, Line 667:  
super nit: extra space


Line 668:     for (const SummaryStatsCounterMap::value_type& v: 
summary_stats_map_) {
I think we need a separate case to handle when total_num_values is 0 because 
avg/min/max don't make sense on 0 samples, plus, min and max will print out 
INT_MAX/INT_MIN which is an implementation detail.

I think this would benefit from a test case in runtime-profile-test.cc as well


PS4, Line 669: (
this brace isn't closed at the end


PS4, Line 676: v.second->unit()
this isn't in the same Counter's units (which is time for the use case you 
added), this is just a raw count. I think you should just print this 
total_num_values without pretty printer.


PS4, Line 1017:  else if
this isn't an else if, you need to execute both branches

this will be wrong when there is just 1 sample.


http://gerrit.cloudera.org:8080/#/c/4371/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

PS4, Line 336: MinMaxAvgValueCounter
update


PS4, Line 337: averages the time taken to
             :     # scan and process the Parquet footer across scan ranges and 
also records the min and
             :     # the max time.
keeps statistics for the time taken to scan and Process the Parquet footer 
across scan ranges


PS4, Line 340:     if ranges_per_node > 1:
             :       num_equal = 0
             :       for min_max_time in footer_processing_time_list:
             :         num_equal += (min_max_time[0] == min_max_time[1])
             :       # The chance that the times taken by different scan ranges 
to process the footer,
             :       # are equal, is very unlikely. However, it could happen 
which could make the min and
             :       # max time to process a footer equal. Assert that at a 
time, no more than 1 node has
             :       # equal min and max time.
             :       assert num_equal < 2
following up with my previous comment, I still think this is weird and was 
hiding at least 1 bug (the max wasn't set when there was only 1 sample). I'm 
not really sure what this verification gets us.

It's fine if you can't compare it to the total time, but please at least check 
that if ranges_per_node is 1 or you see that the number of samples is 1, then 
min = max = avg != 0. That's easy since you don't have to do any string parsing.

That said, I do think we should have some tests that do some better validation, 
e.g. unit tests in runtime-profile-test.cc. Then we don't need to worry about 
parsing times here and we get some more clear coverage.


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Reply via email to