[kudu-CR] [maintenance] use workload statistics to scale perf improvement

Andrew Wong (Code Review) Thu, 09 Jul 2020 15:19:32 -0700

Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15995 )


Change subject: [maintenance] use workload statistics to scale perf_improvement
......................................................................


Patch Set 6:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/15995/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15995/5//COMMIT_MSG@19
PS5, Line 19: dynamically adjust priorities of compaction/flush ops for 
different tables.
Thanks for running these workloads! They do seem to be using a pretty small 
amount of data though. Scans on the order of a few thousand microseconds seem 
like they would have pretty high variance. At best, I think this just tells us 
there isn't a regression for small datasets.

I think the real value of this patch would show in a more realistic scenario 
where there are a couple of tables, one of which is cold, the other is hot, but 
the cold table is in need of compaction more, so Kudu prioritizes the cold 
table. In order for compactions to make the difference, I think we'd need to be 
working with a larger dataset, with tables large enough to have many 
uncompacted rowsets.


http://gerrit.cloudera.org:8080/#/c/15995/5/src/kudu/tablet/tablet.h
File src/kudu/tablet/tablet.h:

http://gerrit.cloudera.org:8080/#/c/15995/5/src/kudu/tablet/tablet.h@496
PS5, Line 496: millisencons
nit: "milliseconds", same elsewhere


http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/tablet/tablet.cc@2065
PS6, Line 2065: FLAGS_collect_workload_stats_min_interval_ms
This and --collect_workload_stats_max_interval_ms don't appear to be measuring 
the same time: this one indicates how frequently we'll compute the rate, 
whereas --collect_workload_stats_max_interval_ms indicates how frequently we'll 
snapshot the metrics.

Maybe instead call them --workload_stats_rate_collection_interval and 
--workload_stats_metric_collection_interval?

Maybe we have these be in the same block? I.e. could we compute rate whenever 
we collect the metrics, and rely on a single flag? Assuming a roughly constant 
rate of calling CollectAndUpdateWorkloadStats(), do you think that would give 
us a more consistent estimate of workload over time? My concern is that the 
quality of the metric varies over time, so we might not be making the best 
decision at all times.


http://gerrit.cloudera.org:8080/#/c/15995/5/src/kudu/tablet/tablet_mm_ops.cc
File src/kudu/tablet/tablet_mm_ops.cc:

http://gerrit.cloudera.org:8080/#/c/15995/5/src/kudu/tablet/tablet_mm_ops.cc@116
PS5, Line 116:   *workload_score = (read_rate + write_rate) > 1 ? 1.0 : 
(read_rate + write_rate);
nit: this is simpler as

 std::min(1, read_rate + write_rate)

Same in TabletReplicaOpBase


http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/tablet/tablet_mm_ops.cc
File src/kudu/tablet/tablet_mm_ops.cc:

http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/tablet/tablet_mm_ops.cc@114
PS6, Line 114:   double write_rate = 0;
Should we scale the write rate based on what we should consider "hot"? 
Otherwise, if we have a write rate of over 1 rps, the tablet is considered hot, 
which doesn't seem all that useful.

Same question for scans. Maybe we should have a flag for what's considered hot, 
for reads and writes? That could also be useful in defining the upper bound 
means with respect to the workload score (e.g. if the perf score is 1, the 
write rate was over 1000 rps).


http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/util/maintenance_manager.cc
File src/kudu/util/maintenance_manager.cc:

http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/util/maintenance_manager.cc@412
PS6, Line 412:     op->UpdateWorkloadScore(&workload_score);
Can we also predicate this on a runtime gflag? I'd like to be a bit careful 
with these kinds of heuristic changes, since it may have performance 
regressions that we haven't noticed yet.


http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/util/maintenance_manager.cc@481
PS6, Line 481: perf_improvement
Shouldn't this be priority?



--
To view, visit http://gerrit.cloudera.org:8080/15995
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie3afcc359002d1392164ba2fda885f8930ef8696
Gerrit-Change-Number: 15995
Gerrit-PatchSet: 6
Gerrit-Owner: Yifan Zhang <chinazhangyi...@163.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <chinazhangyi...@163.com>
Gerrit-Comment-Date: Thu, 09 Jul 2020 22:18:13 +0000
Gerrit-HasComments: Yes

[kudu-CR] [maintenance] use workload statistics to scale perf improvement

Reply via email to