Mahesh Reddy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19931 )

Change subject: KUDU-3476: Make replica placement range and table aware
......................................................................


Patch Set 11:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/19931/10//COMMIT_MSG
Commit Message:

PS10:
> Please also add the link to the design document.
Done


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@231
PS10, Line 231:              "between runs.");
> Why is this change?
Done


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@360
PS10, Line 360:               "of 0 means table locations are not be cached.");
> Why is this change?
Done


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@364
PS10, Line 364:             "Whether to support range-specific hash schemas for 
tables.");
> Why is this change?
Done


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@376
PS10, Line 376: use ran
> use
Done


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@393
PS10, Line 393:               "Set the ratio of how much write limit can be 
reached.");
> Why is this change?
Done


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/catalog_manager.cc@418
PS10, Line 418:             "per range.");
> Why is this change?
Done


http://gerrit.cloudera.org:8080/#/c/19931/6/src/kudu/master/master.proto
File src/kudu/master/master.proto:

http://gerrit.cloudera.org:8080/#/c/19931/6/src/kudu/master/master.proto@377
PS6, Line 377:   optional int64 latest_tsk_seq_num = 6;
             :
             :   // Replica management parameters that the tablet server is 
running with.
             :   // This field is set only if the registration field is present.
             :   optional consensus.ReplicaManagementInfoPB 
replica_management_info = 7;
             :
             :   // The number of tablets that are BOOTSTRAPPING or RUNNING in 
each dimension.
             :   // Used by the master to determine load when placing new 
tablet replicas
             :   // based on dimension.
> ?
I changed the protobuf representation after this to include the table each 
range belongs to in case of ranges with same start keys across different tables.


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master.proto
File src/kudu/master/master.proto:

http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master.proto@338
PS10, Line 338: // The number of tablets that are BOOTSTRAPPING or RUNNING per 
range.
> Would be great to add comments for these two new structures as well.
Done


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master.proto@345
PS10, Line 345: RangePerTablePB {
> Can this be deduced out of information from the field above?
Yes it can be, I considered that approach initially but I thought this would be 
more clear. I'll go ahead and make the change, the replicas per table would no 
longer be stored in other areas of the code as well as it will also be deduced 
from tablets per range.


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master.proto@387
PS10, Line 387:
> What's about other states (except, say, FAILED)?  I could think of a situat
Addressed this in your other comment about this, thanks for bringing this up.


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master_runner.cc
File src/kudu/master/master_runner.cc:

http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/master_runner.cc@350
PS10, Line 350:  // Master always reads the latest data snapshot from the 
system catalog and
              :   // never uses any specific past timestamp for a read 
snapshot. With that,
              :   // it doesn't make sense to keep a long chain of UNDO deltas 
in addition to the
              :   // latest version in the MVCC. Keeping a short history of 
deltas frees CPU
> I've mentioned that multiple times already, but I see you are continuing do
Understood, I can see that it makes it harder to review the patch. I'll keep 
unrelated changes in a different patch from now on.


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/placement_policy.cc
File src/kudu/master/placement_policy.cc:

http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/placement_policy.cc@152
PS10, Line 152: CHECK(!replicas_by_range.empty());
> nit: I'd change the criterion to use vector::empty() -- it's usually implem
Done


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/placement_policy.cc@159
PS10, Line 159: CHECK(!replicas_by_table.empty());
> ditto
Done


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/placement_policy.cc@166
PS10, Line 166: CHECK(!replicas_total.empty());
> ditto
Done


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/ts_descriptor.cc
File src/kudu/master/ts_descriptor.cc:

http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/ts_descriptor.cc@208
PS10, Line 208:       
recent_replicas_by_range_[table_id].first.find(range_start_key) ==
              :       recent_replicas_by_range_[table_id].first.end() ||
              :       
recent_replicas_by_range_[table_id].first[range_start_key] == 0) {
> nit: wrong indent; also, use braces here for the scope since even the crite
Done


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/master/ts_descriptor.cc@214
PS10, Line 214:   const double kHalflifeSecs = 
FLAGS_tserver_last_replica_creations_halflife_ms / 1000;
              :   const MonoTime now = MonoTime::Now();
              :   // If map for the table or range hasn't been initialized yet, 
use init_time_ as last decay.
              :   MonoTime last_decay =
              :       last_replica_decay_by_range_.find(table_id) == 
last_replica_decay_by_range_.end() ||
              :       
last_replica_decay_by_range_[table_id].first.find(range_start_key) ==
              :       last_replica_decay_by_range_[table_id].first.end() ?
              :       init_time_ : 
last_replica_decay_by_range_[table_id].first[range_start_key];
              :   double secs_since_last_decay = (now - last_decay).To
> Why do we need any recency information in case of range-aware replica place
When placing replicas, we currently use two stats: The number of live replicas 
and the number of recently placed replicas. The number of live replicas is the 
number of replicas before the placement of replicas begins. If we don't have 
the information of the recently placed replicas, then during the current 
iteration of placing replicas we'll only be using the number of replicas placed 
on the tserver before the placement begins.

Example:

TabletNumByRangeMap range({ {"a1", 500} });

 const vector<LocationInfo> cluster_info = {
      {
          "",
          {
              { "ts0", 0 },
              { "ts1", 0 },
              { "ts2", 500, { }, { {"t2", range }, } },
              { "ts3", 500, { }, { {"t1", range }, } },
          }
      },
  };

This is from PlacementPolicyTest.PlaceTabletReplicasWithRangesAndTables. Four 
tablet servers exist, two with no replicas, one with 500 replicas of range "a1" 
of table "t1", and one with 500 replicas of range "a1" of table "t2".

If we place 3000 replicas of range "a1" of table "t1", without taking into 
account the recently placed replicas, the algorithm would place 1000 replicas 
on each of tablet server "ts0", "ts1", "ts2". This is because it would only be 
factoring in the number of live replicas and not the number of pending 
replicas, so it would see 500 replicas of range "a1" of table "t1" on tablet 
server "ts3" and none on the others throughout the process.

If we place 3000 replicas of range "a1" of table "t1" but this time we take 
into account the recently placed replicas, the algorithm will recognize the 
recently placed replicas and place around 870 replicas on tablet servers "ts0, 
"ts1, and "ts2", and only 370 replicas on "ts2". This is because while placing 
replicas, the algorithm recognizes the recently placed replicas, so after 
placing 500 on the three tablet servers without any initial load, it will then 
distribute the remaining replicas amongst the four tablet servers evenly.

I'm still unsure if we need the decay information. Without it, the 
recentReplicaCreation values wouldn't be reset as it decays the values back to 
0.

I can update the commit message to include some information about using 
recently created replicas.


http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

http://gerrit.cloudera.org:8080/#/c/19931/10/src/kudu/tserver/ts_tablet_manager.h@222
PS10, Line 222: umLiveTabletsByRangePerT
> What about tablet replicas in, say, INITIALIZED state, or any other state t
Good point, I used the GetNumLiveTablets() method as inspiration here. That 
method uses only RUNNING or BOOTSTRAPPING tablets. I'll defer to you here as 
I'm not too familiar with the different tablet states. Which other ones do you 
think should be included besides INITIALIZED?



--
To view, visit http://gerrit.cloudera.org:8080/19931
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9caeb8d5547e946bfeb152a99e1ec034c3fa0a0f
Gerrit-Change-Number: 19931
Gerrit-PatchSet: 11
Gerrit-Owner: Mahesh Reddy <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <[email protected]>
Gerrit-Reviewer: Marton Greber <[email protected]>
Gerrit-Reviewer: Zoltan Chovan <[email protected]>
Gerrit-Reviewer: Ádám Bakai <[email protected]>
Gerrit-Comment-Date: Thu, 29 Jun 2023 00:18:07 +0000
Gerrit-HasComments: Yes

Reply via email to