[kudu-CR] [pruning] wip KUDU-2671: Modifies pruning logic to include per range hash schemas.

Mahesh Reddy (Code Review) Wed, 14 Jul 2021 22:46:24 -0700

Mahesh Reddy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17643 )


Change subject: [pruning] wip KUDU-2671: Modifies pruning logic to include per 
range hash schemas.
......................................................................


Patch Set 2:

(33 comments)

http://gerrit.cloudera.org:8080/#/c/17643/1//COMMIT_MSG
Commit Message:

PS1: 
> nit: it would be nice to reformat this commit description to follow the gui
Took a quick look, but what specifically did you think needs to be changed 
about the commit message's format? I made the lengths of the lines shorter.


http://gerrit.cloudera.org:8080/#/c/17643/1//COMMIT_MSG@11
PS1, Line 11:  Currently, when the ranges are constru
> nit: same table-wide set of hash bucket schemas is used for every range
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.h@285
PS1, Line 285: lhs_hb_sch
> nit: if the other parameter named 'rhs_hb_schemas', maybe name this one as
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.h@286
PS1, Line 286:
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.cc@1108
PS1, Line 1108:
              : bool PartitionSchema::EqualHashBucketSchemas(const 
HashBucketSchemas& hb_schemas,
              :                                              const 
HashBucketSchemas& rhs_hb_schemas) {
> +1
It's included in this patch for the time being because I was planning on 
checking for duplicate hash bucket schemas when constructing partition key 
ranges to avoid building duplicate ranges.


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.cc@1149
PS1, Line 1149:  rh
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc
File src/kudu/common/partition_pruner-test.cc:

http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1057
PS1, Line 1057:                                table_hash_schema,
              :                              PartitionSchemaPB* 
partition_schema_pb) {
              :   auto range_schema = 
partition_schema_pb->mutable_range_schema();
              :   for (const auto& range_column : range_columns) {
              :     range_schema->add_columns()->set_name(range_column);
              :   }
              :   for (const auto& hash_schema : table_hash_schema) {
              :     auto hash_schema_component = 
partition_schema_pb->add_hash_bucket_schemas();
              :     for (const auto& hash_schema_columns : get<0>(hash_schema)) 
{
              :       
hash_schema_component->add_columns()->set_name(hash_schema_columns);
              :     }
              :     hash_schema_component->set_num_buckets(get<1>(hash_schema));
              :     hash_schema_component->set_seed(get<2>(hash_schema));
              :   }
              : }
              :
              : void AddRangePartitionWithSchema(const Schema& schema,
              :                                  const vector<pair<string, 
string>>& lower_string_cols,
              :                                  const vector<pair<string, 
string>>& upper_string_cols,
              :                                  const vector<pair<string, 
int8_t>>& lower_int_cols,
              :                                  const vector<pair<string, 
int8_t>>& upper_int_cols,
              :                                  const 
vector<tuple<vector<string>, int32_t, uint32_t>>&
              :                                    hash_schemas,
              :                                  vector<pair<K
> nit: just looking for ways to reuse some code. May be worth defining a help
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1095
PS1, Line 1095:   for (const auto& bound : upper_int_cols) {
              :     ASSERT_OK(upper.SetInt8(bound.first, boun
> nit: to avoid ambiguity in reading this, could we perhaps uppercase the col
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1147
PS1, Line 1147:   {
> warning: unused variable 'range_hash_component' [clang-diagnostic-unused-va
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1155
PS1, Line 1155:     vector<string> columns = {"A"};
              :     vector<string> columns_1 = {"B"};
              :     AddRangePartitionWithSchema(schema, {std::make_pair("C", 
"d")}, {std::make_pair("C", "f")},{},
              :                                 {}, {make_tuple(columns, 2, 0), 
make_tuple(columns_1, 3, 0)},
              :                                 &bounds, &range_hash_schemas, 
&pb);
              :   }
              :
              :   // [(_, _, h), (_, _, j))
              :   {
              :     AddRangePartitionWithSchema(schema, {std::make_pair("C", 
"h")}, {std::make_pair("C", "j")},
              :                                 {}, {}, {}, &bounds, 
&range_hash_schemas, &pb);
              :   }
              :
> nit: may be worth adding some helper to modify a partition schema? Maybe so
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1183
PS1, Line 1183:   // Applies the specified predicates to a scan and checks that 
the expected
              :   // number of partitions are pruned.
> Just curious, is setting these different from setting range predicates on t
There are three ways to set bounds on a scan. The first way being adding 
predicates (e.g. range/equality). The second way being setting lower and upper 
bound primary keys. The last way is setting lower and upper bound partition 
keys as shown in this test. However, setting partition keys is considered 
unstable and is only meant for internal use.


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1211
PS1, Line 1211:
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1312
PS1, Line 1312:                   Co
> nit: move this down closer to where it's used. That way it's less distracti
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1337
PS1, Line 1337:
> warning: Value stored to 'range_hash_component' during its initialization i
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1356
PS1, Line 1356:                     optional<tuple<int8_t, int8_t, int8_t>> 
upper,
> warning: Value stored to 'range_hash_component' during its initialization i
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1375
PS1, Line 1375:       CHECK_OK(upper_bound.SetInt8("b", get<1>(*upper)));
> warning: unused variable 'range_hash_component' [clang-diagnostic-unused-va
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1506
PS1, Line 1506:         7,  7);
> warning: Value stored to 'range_hash_component' during its initialization i
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.h
File src/kudu/common/partition_pruner.h:

http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.h@66
PS1, Line 66: s true
> nit: num_ranges ?
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.h@99
PS1, Line 99:
            :   // A vector of a pair of lower and upper range bounds mapped to 
a
            :   // reverse sorted set of partition key ranges. Each
> +1
The tuple type was already in use before my changes, I added a pair of strings 
for the range bounds mostly because it's slightly easier to access the contents 
of the pair. If there's a preference for pair or tuple, I wouldn't mind only 
using one type. That being said, the distinction between the range bounds and 
partition key ranges will no longer exist if we conform to one type for both. 
Another alternative is using a struct but that will also be a bit redundant 
since for both range bounds and partition key ranges it's a pair of strings.


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc
File src/kudu/common/partition_pruner.cc:

http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@412
PS1, Line 412:   // the lower and upper bounds specified by the scan.
> nit: either through a comment or a variable name change (e.g. scan_range_lo
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@433
PS1, Line 433: as if t
> nit: would be nice to add a comment describing this block, e.g. something l
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@435
PS1, Line 435: artition_sch
> nit: We seem to be conflating the plurality of hash_schema and hash_schemas
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@434
PS1, Line 434:  the range bounds specified by the scan.
             :   if (!partition_sch
> nit: when splitting a long declaration, I find it more convenient as a read
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@438
PS1, Line 438:       const auto* hash_schemas = range.hash_schemas.empty() ?
             :           &partition_schema.hash_bucket_schemas_ : 
&range.hash_schemas;
             :       // Both lower and upper bounds are unbounded.
             :       i
> nit: we could simplify this as
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@444
PS1, Line 444:
> Why can't we use emplace_back for 'hash_schemas'?
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@473
PS1, Line 473:       vector<PartitionKeyRange> partition_key_ranges(1);
             :       if (scan_range_lower_bound.empty() && 
scan_range_upper_bound.empty()) {
             :         ConstructPartitionKeyRanges(schema, partition_schema, 
scan_spec,
             :                                     *hash_schema, 
make_pair(range_bounds[i].first,
             :
> I don't quite follow this case. If the scan's lower bound is below the rang
Imagine there are 3 range bounds: (0, 10), (10, 20), (20, 30). If the scan's 
lower range bound is 5 and the scan's upper range bound is 25, then in order to 
add the hash schemas from the bound (10, 20) we need the last if statement.

In regards to your assumption about the scan's upper bound being below the 
range's lower bound, it only falls to the last if statement if the scan's upper 
bound isn't fully contained within the current range we are iterating through.

All that being said, this case is merged with the new simplification of the 
code.


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@463
PS1, Line 463:         range_bounds.emplace_back(range.lower, range.upper);
             :         hash_schemas_per_range.emplace_back(hash_schemas);
             :       }
             :     }
             :     partition_key_ranges_.resize(hash_schemas_per_range.size());
             :     DCHECK_EQ(range_bounds.size(), 
hash_schemas_per_range.size());
             :     // Constructs partition key ranges from the ranges and their 
respective hash schemas
             :     // that fell within the scan's bounds.
             :     for (int i = 0; i < hash_schemas_per_range.size(); i++) {
             :       const auto& hash_schema = hash_schemas_per_range[i];
             :       vector<PartitionKeyRange> partition_key_ranges(1);
             :       if (scan_range_lower_bound.empty() && 
scan_range_upper_bound.empty()) {
             :         ConstructPartitionKeyRanges(schema, partition_schema, 
scan_spec,
             :                                     *hash_schema, 
make_pair(range_bounds[i].first,
             :
> Maybe I'm missing something, but it seems like this code should just be che
Thanks for pointing this out, the simplified code I ended up writing has 
different equality checks than the one you presented, but it led me down the 
right path.


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@502
PS1, Line 502:   if (!scan_spe
> nit: IMO these step counts aren't super useful and are a bit confusing now
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@518
PS1, Line 518: er_bound.empty()) {
> nit: this looks a bit strange to re-use test-only method as a base for regu
I agree, the original motivation behind the method seems to be for tests only, 
but I don't see why we couldn't use it for this method. The alternative is just 
duplicating the code from `NumRangesRemainingForTests()` for this method.


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@523
PS1, Line 523: _key_range : partition_key_range
> Could it happen that partition_key_ranges_ is empty?  Maybe, add correspond
There's a CHECK() in the line above that does that.


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@524
PS1, Line 524: nge = partition_key_range.second.
> nit: BTW, there is std::vector::back() method
Done


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@550
PS1, Line 550: _bound(partition_key_range.s
> Would it be possible to use partition_key_start() without wrapping it into
Both the `range_key_start()` and `range_key_end()` methods within the Partition 
class return a Slice. Maybe I could use the data wrapped within the Slice 
directly. Any reason why we don't want to use a Slice here?


http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@586
PS1, Line 586: pace kudu
             :
> nit: individual values that span more than one line should be indented such
Done



--
To view, visit http://gerrit.cloudera.org:8080/17643
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I05c37495430f61a2c6f6012c72251138aee465b7
Gerrit-Change-Number: 17643
Gerrit-PatchSet: 2
Gerrit-Owner: Mahesh Reddy <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <[email protected]>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 15 Jul 2021 05:46:09 +0000
Gerrit-HasComments: Yes

[kudu-CR] [pruning] wip KUDU-2671: Modifies pruning logic to include per range hash schemas.

Reply via email to