Mahesh Reddy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16859 )

Change subject: [master] Range specific hashing at table creation time.
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/16859/1/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16859/1/src/kudu/common/partition.cc@451
PS1, Line 451:     // The number of ranges before being split should match the 
size of range_hash_schemas.
> I guess I need to understand what are split rows to better understand what'
added a comment in the second patch, lmk if you think it needs further 
clarification.


http://gerrit.cloudera.org:8080/#/c/16859/2/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16859/2/src/kudu/common/partition.cc@418
PS2, Line 418:   if (!split_rows.empty()) {
             :     for (const auto& hash_schemas : range_hash_schemas) {
             :       if (!hash_schemas.empty()) {
             :         return Status::InvalidArgument("Both 'split_rows' and 
'range_hash_schemas' cannot be "
             :                                        "populated at the same 
time.");
             :       }
             :     }
> Should we be checking here that ranger_hash_schemas is empty if split_rows
Ultimately, if 'range_hash_schema' is a vector filled with empty vectors it 
essentially serves the same purpose as 'range_hash_schema' being empty. At that 
point, the table wide hash schema will be applied to each new range. 
'range_hash_schema' doesn't add any new functionality unless one of its vectors 
is defined and that is where the incompatibility with 'split_rows' comes in.

Now, the reason why I need 'range_hash_schema' to remain the same way is due to 
the ordering of the bounds and their schemas being passed in. If I omit an 
entry in 'range_hash_schema' due to its respective bound not having a user 
defined schema while other bounds have their respective schemas defined, then I 
won't have a way to match the schemas to the bound. A way around this is to 
modify the current 'range_bounds' parameter to 'vector<tuple<KuduPartialRow, 
KuduPartialRow,vector<HashBucketSchema>>>'. This way, the bounds are matched to 
their respective hash schemas. If you think this is more clear (or using a 
struct instead of a tuple), I can consider making that change.


http://gerrit.cloudera.org:8080/#/c/16859/1/src/kudu/integration-tests/table_locations-itest.cc
File src/kudu/integration-tests/table_locations-itest.cc:

http://gerrit.cloudera.org:8080/#/c/16859/1/src/kudu/integration-tests/table_locations-itest.cc@474
PS1, Line 474: EQ(14,
> nit: can just use ASSERT_FALSE
Done


http://gerrit.cloudera.org:8080/#/c/16859/2/src/kudu/integration-tests/table_locations-itest.cc
File src/kudu/integration-tests/table_locations-itest.cc:

http://gerrit.cloudera.org:8080/#/c/16859/2/src/kudu/integration-tests/table_locations-itest.cc@442
PS2, Line 442: TEST_F(TableLocationsTest, TestRangeSpecificHashing) {
> We chatted about this over a call, but for posterity, it'd be nice to test
will add this in next iteration.


http://gerrit.cloudera.org:8080/#/c/16859/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/16859/2/src/kudu/master/catalog_manager.cc@1714
PS2, Line 1714:         PartitionSchema::HashBucketSchemas hash_bucket_schemas;
              :         if (j < req.range_hash_schemas_size()) {
              :           const auto& hash_schemas_pb = 
req.range_hash_schemas(j++);
              :           if (!hash_schemas_pb.hash_schemas().empty()) {
              :             RETURN_NOT_OK(
              :                 
PartitionSchema::ExtractHashBucketSchemasFromPB(schema,
              :                                                                 
hash_schemas_pb.hash_schemas(),
              :                                                                 
&hash_bucket_schemas));
              :           }
              :         }
> IMO the intermixed parsing of 'range_hash_schemas' and 'ops' makes this pre
good catch, will move out 'range_hash_schema' logic out of for loop for 'ops'.


http://gerrit.cloudera.org:8080/#/c/16859/2/src/kudu/master/master.proto
File src/kudu/master/master.proto:

http://gerrit.cloudera.org:8080/#/c/16859/2/src/kudu/master/master.proto@504
PS2, Line 504:   optional PartitionSchemaPB partition_schema = 7;
> nit: can you also comment how this is used if 'range_hash_schemas' is speci
Done


http://gerrit.cloudera.org:8080/#/c/16859/2/src/kudu/master/master.proto@505
PS2, Line 505:   repeated PerRangeHashBucketSchemasPB range_hash_schemas = 12;
> nit: add docs to how this should be populated, its ordering constraints wit
Done



--
To view, visit http://gerrit.cloudera.org:8080/16859
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8f0dcbc3324f8f2d6e99b4d169fdf5c7f7dff95d
Gerrit-Change-Number: 16859
Gerrit-PatchSet: 2
Gerrit-Owner: Mahesh Reddy <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Bankim Bhavsar <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <[email protected]>
Gerrit-Comment-Date: Tue, 15 Dec 2020 00:11:54 +0000
Gerrit-HasComments: Yes

Reply via email to