Matthew Jacobs has posted comments on this change. Change subject: IMPALA-3823: Add timer to measure Parquet footer reads ......................................................................
Patch Set 4: (15 comments) I was hoping to get to +1 this pass but there's a bit more still... Most importantly there's a bug in computing max and I think we need unit tests. http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: PS4, Line 196: RuntimeProfile::Counter single_footer_process_timer(TUnit::TIME_NS); this is unnecessary, SCOPED_TIMER below just creates a ScopedTimer which creates a MonotonicStopWatch and adds the result of it to this counter, so you might as well just use MonotonicStopWatch directly. http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/util/runtime-profile-counters.h File be/src/util/runtime-profile-counters.h: PS4, Line 49: ADD_MIN_MAX_AVG_VALUE_TIMER please rename PS4, Line 70: ADD_MIN_MAX_AVG_VALUE_TIMER same PS4, Line 147: . can you add: This is used to aggregate child counters. PS4, Line 204: /// Contrary to the AveragedCounter, this only keeps a track of values and not 'Counter' : /// objects themselves. this is a bit misleading because it doesn't really keep values, it keeps the statistics about the values. How about: Contrary to the AveragedCoutner, this keeps statistics about raw values whereas AveragedCounter is used for averaging child counters. PS4, Line 249: /// Current sums of values seen so far and min/max values seen so far. : int64_t current_min_; : int64_t current_max_; : int64_t current_sum_; current is unnecessary http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/util/runtime-profile.cc File be/src/util/runtime-profile.cc: PS4, Line 301: const TSummaryStatsCounter& c = node.summary_stats_counters[i]; : SummaryStatsCounterMap::iterator it = summary_stats_map_.find(c.name); : if (it == summary_stats_map_.end()) { : summary_stats_map_[c.name] = : pool_->Add(new SummaryStatsCounter( : c.unit, c.total_num_values, c.min_value, c.max_value, c.sum)); : } else { : it->second->SetStats(c); : } it's hard to know if this gets exercised in your test case. Can you add a test in runtime-profile-test.cc that sets up a stats counter and goes through a merge? PS4, Line 667: super nit: extra space Line 668: for (const SummaryStatsCounterMap::value_type& v: summary_stats_map_) { I think we need a separate case to handle when total_num_values is 0 because avg/min/max don't make sense on 0 samples, plus, min and max will print out INT_MAX/INT_MIN which is an implementation detail. I think this would benefit from a test case in runtime-profile-test.cc as well PS4, Line 669: ( this brace isn't closed at the end PS4, Line 676: v.second->unit() this isn't in the same Counter's units (which is time for the use case you added), this is just a raw count. I think you should just print this total_num_values without pretty printer. PS4, Line 1017: else if this isn't an else if, you need to execute both branches this will be wrong when there is just 1 sample. http://gerrit.cloudera.org:8080/#/c/4371/4/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: PS4, Line 336: MinMaxAvgValueCounter update PS4, Line 337: averages the time taken to : # scan and process the Parquet footer across scan ranges and also records the min and : # the max time. keeps statistics for the time taken to scan and Process the Parquet footer across scan ranges PS4, Line 340: if ranges_per_node > 1: : num_equal = 0 : for min_max_time in footer_processing_time_list: : num_equal += (min_max_time[0] == min_max_time[1]) : # The chance that the times taken by different scan ranges to process the footer, : # are equal, is very unlikely. However, it could happen which could make the min and : # max time to process a footer equal. Assert that at a time, no more than 1 node has : # equal min and max time. : assert num_equal < 2 following up with my previous comment, I still think this is weird and was hiding at least 1 bug (the max wasn't set when there was only 1 sample). I'm not really sure what this verification gets us. It's fine if you can't compare it to the total time, but please at least check that if ranges_per_node is 1 or you see that the number of samples is 1, then min = max = avg != 0. That's easy since you don't have to do any string parsing. That said, I do think we should have some tests that do some better validation, e.g. unit tests in runtime-profile-test.cc. Then we don't need to worry about parsing times here and we get some more clear coverage. -- To view, visit http://gerrit.cloudera.org:8080/4371 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-HasComments: Yes
