Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15995 )
Change subject: [maintenance] use workload statistics to scale perf_improvement ...................................................................... Patch Set 6: (7 comments) http://gerrit.cloudera.org:8080/#/c/15995/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15995/5//COMMIT_MSG@19 PS5, Line 19: dynamically adjust priorities of compaction/flush ops for different tables. Thanks for running these workloads! They do seem to be using a pretty small amount of data though. Scans on the order of a few thousand microseconds seem like they would have pretty high variance. At best, I think this just tells us there isn't a regression for small datasets. I think the real value of this patch would show in a more realistic scenario where there are a couple of tables, one of which is cold, the other is hot, but the cold table is in need of compaction more, so Kudu prioritizes the cold table. In order for compactions to make the difference, I think we'd need to be working with a larger dataset, with tables large enough to have many uncompacted rowsets. http://gerrit.cloudera.org:8080/#/c/15995/5/src/kudu/tablet/tablet.h File src/kudu/tablet/tablet.h: http://gerrit.cloudera.org:8080/#/c/15995/5/src/kudu/tablet/tablet.h@496 PS5, Line 496: millisencons nit: "milliseconds", same elsewhere http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/tablet/tablet.cc File src/kudu/tablet/tablet.cc: http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/tablet/tablet.cc@2065 PS6, Line 2065: FLAGS_collect_workload_stats_min_interval_ms This and --collect_workload_stats_max_interval_ms don't appear to be measuring the same time: this one indicates how frequently we'll compute the rate, whereas --collect_workload_stats_max_interval_ms indicates how frequently we'll snapshot the metrics. Maybe instead call them --workload_stats_rate_collection_interval and --workload_stats_metric_collection_interval? Maybe we have these be in the same block? I.e. could we compute rate whenever we collect the metrics, and rely on a single flag? Assuming a roughly constant rate of calling CollectAndUpdateWorkloadStats(), do you think that would give us a more consistent estimate of workload over time? My concern is that the quality of the metric varies over time, so we might not be making the best decision at all times. http://gerrit.cloudera.org:8080/#/c/15995/5/src/kudu/tablet/tablet_mm_ops.cc File src/kudu/tablet/tablet_mm_ops.cc: http://gerrit.cloudera.org:8080/#/c/15995/5/src/kudu/tablet/tablet_mm_ops.cc@116 PS5, Line 116: *workload_score = (read_rate + write_rate) > 1 ? 1.0 : (read_rate + write_rate); nit: this is simpler as std::min(1, read_rate + write_rate) Same in TabletReplicaOpBase http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/tablet/tablet_mm_ops.cc File src/kudu/tablet/tablet_mm_ops.cc: http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/tablet/tablet_mm_ops.cc@114 PS6, Line 114: double write_rate = 0; Should we scale the write rate based on what we should consider "hot"? Otherwise, if we have a write rate of over 1 rps, the tablet is considered hot, which doesn't seem all that useful. Same question for scans. Maybe we should have a flag for what's considered hot, for reads and writes? That could also be useful in defining the upper bound means with respect to the workload score (e.g. if the perf score is 1, the write rate was over 1000 rps). http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/util/maintenance_manager.cc File src/kudu/util/maintenance_manager.cc: http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/util/maintenance_manager.cc@412 PS6, Line 412: op->UpdateWorkloadScore(&workload_score); Can we also predicate this on a runtime gflag? I'd like to be a bit careful with these kinds of heuristic changes, since it may have performance regressions that we haven't noticed yet. http://gerrit.cloudera.org:8080/#/c/15995/6/src/kudu/util/maintenance_manager.cc@481 PS6, Line 481: perf_improvement Shouldn't this be priority? -- To view, visit http://gerrit.cloudera.org:8080/15995 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie3afcc359002d1392164ba2fda885f8930ef8696 Gerrit-Change-Number: 15995 Gerrit-PatchSet: 6 Gerrit-Owner: Yifan Zhang <chinazhangyi...@163.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Yifan Zhang <chinazhangyi...@163.com> Gerrit-Comment-Date: Thu, 09 Jul 2020 22:18:13 +0000 Gerrit-HasComments: Yes