Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19931 )
Change subject: KUDU-3476: Make replica placement range and table aware ...................................................................... Patch Set 11: (18 comments) http://gerrit.cloudera.org:8080/#/c/19931/10//COMMIT_MSG Commit Message: PS10: > Please also add the link to the design document. Done http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@231 PS10, Line 231: "between runs."); > Why is this change? Done http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@360 PS10, Line 360: "of 0 means table locations are not be cached."); > Why is this change? Done http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@364 PS10, Line 364: "Whether to support range-specific hash schemas for tables."); > Why is this change? Done http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@376 PS10, Line 376: use ran > use Done http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@393 PS10, Line 393: "Set the ratio of how much write limit can be reached."); > Why is this change? Done http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@418 PS10, Line 418: "per range."); > Why is this change? Done http://gerrit.cloudera.org:8080/#/c/19931/6/src/kudu/master/master.proto File src/kudu/master/master.proto: http://gerrit.cloudera.org:8080/#/c/19931/6/src/kudu/master/master.proto@377 PS6, Line 377: optional int64 latest_tsk_seq_num = 6; : : // Replica management parameters that the tablet server is running with. : // This field is set only if the registration field is present. : optional consensus.ReplicaManagementInfoPB replica_management_info = 7; : : // The number of tablets that are BOOTSTRAPPING or RUNNING in each dimension. : // Used by the master to determine load when placing new tablet replicas : // based on dimension. > ? I changed the protobuf representation after this to include the table each range belongs to in case of ranges with same start keys across different tables. http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master.proto File src/kudu/master/master.proto: http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master.proto@338 PS10, Line 338: // The number of tablets that are BOOTSTRAPPING or RUNNING per range. > Would be great to add comments for these two new structures as well. Done http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master.proto@345 PS10, Line 345: RangePerTablePB { > Can this be deduced out of information from the field above? Yes it can be, I considered that approach initially but I thought this would be more clear. I'll go ahead and make the change, the replicas per table would no longer be stored in other areas of the code as well as it will also be deduced from tablets per range. http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master.proto@387 PS10, Line 387: > What's about other states (except, say, FAILED)? I could think of a situat Addressed this in your other comment about this, thanks for bringing this up. http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master_runner.cc File src/kudu/master/master_runner.cc: http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master_runner.cc@350 PS10, Line 350: // Master always reads the latest data snapshot from the system catalog and : // never uses any specific past timestamp for a read snapshot. With that, : // it doesn't make sense to keep a long chain of UNDO deltas in addition to the : // latest version in the MVCC. Keeping a short history of deltas frees CPU > I've mentioned that multiple times already, but I see you are continuing do Understood, I can see that it makes it harder to review the patch. I'll keep unrelated changes in a different patch from now on. http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/placement_policy.cc File src/kudu/master/placement_policy.cc: http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/placement_policy.cc@152 PS10, Line 152: CHECK(!replicas_by_range.empty()); > nit: I'd change the criterion to use vector::empty() -- it's usually implem Done http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/placement_policy.cc@159 PS10, Line 159: CHECK(!replicas_by_table.empty()); > ditto Done http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/placement_policy.cc@166 PS10, Line 166: CHECK(!replicas_total.empty()); > ditto Done http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/ts_descriptor.cc File src/kudu/master/ts_descriptor.cc: http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/ts_descriptor.cc@208 PS10, Line 208: recent_replicas_by_range_[table_id].first.find(range_start_key) == : recent_replicas_by_range_[table_id].first.end() || : recent_replicas_by_range_[table_id].first[range_start_key] == 0) { > nit: wrong indent; also, use braces here for the scope since even the crite Done http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/ts_descriptor.cc@214 PS10, Line 214: const double kHalflifeSecs = FLAGS_tserver_last_replica_creations_halflife_ms / 1000; : const MonoTime now = MonoTime::Now(); : // If map for the table or range hasn't been initialized yet, use init_time_ as last decay. : MonoTime last_decay = : last_replica_decay_by_range_.find(table_id) == last_replica_decay_by_range_.end() || : last_replica_decay_by_range_[table_id].first.find(range_start_key) == : last_replica_decay_by_range_[table_id].first.end() ? : init_time_ : last_replica_decay_by_range_[table_id].first[range_start_key]; : double secs_since_last_decay = (now - last_decay).To > Why do we need any recency information in case of range-aware replica place When placing replicas, we currently use two stats: The number of live replicas and the number of recently placed replicas. The number of live replicas is the number of replicas before the placement of replicas begins. If we don't have the information of the recently placed replicas, then during the current iteration of placing replicas we'll only be using the number of replicas placed on the tserver before the placement begins. Example: TabletNumByRangeMap range({ {"a1", 500} }); const vector<LocationInfo> cluster_info = { { "", { { "ts0", 0 }, { "ts1", 0 }, { "ts2", 500, { }, { {"t2", range }, } }, { "ts3", 500, { }, { {"t1", range }, } }, } }, }; This is from PlacementPolicyTest.PlaceTabletReplicasWithRangesAndTables. Four tablet servers exist, two with no replicas, one with 500 replicas of range "a1" of table "t1", and one with 500 replicas of range "a1" of table "t2". If we place 3000 replicas of range "a1" of table "t1", without taking into account the recently placed replicas, the algorithm would place 1000 replicas on each of tablet server "ts0", "ts1", "ts2". This is because it would only be factoring in the number of live replicas and not the number of pending replicas, so it would see 500 replicas of range "a1" of table "t1" on tablet server "ts3" and none on the others throughout the process. If we place 3000 replicas of range "a1" of table "t1" but this time we take into account the recently placed replicas, the algorithm will recognize the recently placed replicas and place around 870 replicas on tablet servers "ts0, "ts1, and "ts2", and only 370 replicas on "ts2". This is because while placing replicas, the algorithm recognizes the recently placed replicas, so after placing 500 on the three tablet servers without any initial load, it will then distribute the remaining replicas amongst the four tablet servers evenly. I'm still unsure if we need the decay information. Without it, the recentReplicaCreation values wouldn't be reset as it decays the values back to 0. I can update the commit message to include some information about using recently created replicas. http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/tserver/ts_tablet_manager.h@222 PS10, Line 222: umLiveTabletsByRangePerT > What about tablet replicas in, say, INITIALIZED state, or any other state t Good point, I used the GetNumLiveTablets() method as inspiration here. That method uses only RUNNING or BOOTSTRAPPING tablets. I'll defer to you here as I'm not too familiar with the different tablet states. Which other ones do you think should be included besides INITIALIZED? -- To view, visit http://gerrit.cloudera.org:8080/19931 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9caeb8d5547e946bfeb152a99e1ec034c3fa0a0f Gerrit-Change-Number: 19931 Gerrit-PatchSet: 11 Gerrit-Owner: Mahesh Reddy <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Ashwani Raina <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy <[email protected]> Gerrit-Reviewer: Marton Greber <[email protected]> Gerrit-Reviewer: Zoltan Chovan <[email protected]> Gerrit-Reviewer: Ádám Bakai <[email protected]> Gerrit-Comment-Date: Thu, 29 Jun 2023 00:18:07 +0000 Gerrit-HasComments: Yes
