Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17643 )
Change subject: [pruning] wip KUDU-2671: Modifies pruning logic to include per range hash schemas. ...................................................................... Patch Set 2: (33 comments) http://gerrit.cloudera.org:8080/#/c/17643/1//COMMIT_MSG Commit Message: PS1: > nit: it would be nice to reformat this commit description to follow the gui Took a quick look, but what specifically did you think needs to be changed about the commit message's format? I made the lengths of the lines shorter. http://gerrit.cloudera.org:8080/#/c/17643/1//COMMIT_MSG@11 PS1, Line 11: Currently, when the ranges are constru > nit: same table-wide set of hash bucket schemas is used for every range Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.h File src/kudu/common/partition.h: http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.h@285 PS1, Line 285: lhs_hb_sch > nit: if the other parameter named 'rhs_hb_schemas', maybe name this one as Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.h@286 PS1, Line 286: > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.cc File src/kudu/common/partition.cc: http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.cc@1108 PS1, Line 1108: : bool PartitionSchema::EqualHashBucketSchemas(const HashBucketSchemas& hb_schemas, : const HashBucketSchemas& rhs_hb_schemas) { > +1 It's included in this patch for the time being because I was planning on checking for duplicate hash bucket schemas when constructing partition key ranges to avoid building duplicate ranges. http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition.cc@1149 PS1, Line 1149: rh > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc File src/kudu/common/partition_pruner-test.cc: http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1057 PS1, Line 1057: table_hash_schema, : PartitionSchemaPB* partition_schema_pb) { : auto range_schema = partition_schema_pb->mutable_range_schema(); : for (const auto& range_column : range_columns) { : range_schema->add_columns()->set_name(range_column); : } : for (const auto& hash_schema : table_hash_schema) { : auto hash_schema_component = partition_schema_pb->add_hash_bucket_schemas(); : for (const auto& hash_schema_columns : get<0>(hash_schema)) { : hash_schema_component->add_columns()->set_name(hash_schema_columns); : } : hash_schema_component->set_num_buckets(get<1>(hash_schema)); : hash_schema_component->set_seed(get<2>(hash_schema)); : } : } : : void AddRangePartitionWithSchema(const Schema& schema, : const vector<pair<string, string>>& lower_string_cols, : const vector<pair<string, string>>& upper_string_cols, : const vector<pair<string, int8_t>>& lower_int_cols, : const vector<pair<string, int8_t>>& upper_int_cols, : const vector<tuple<vector<string>, int32_t, uint32_t>>& : hash_schemas, : vector<pair<K > nit: just looking for ways to reuse some code. May be worth defining a help Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1095 PS1, Line 1095: for (const auto& bound : upper_int_cols) { : ASSERT_OK(upper.SetInt8(bound.first, boun > nit: to avoid ambiguity in reading this, could we perhaps uppercase the col Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1147 PS1, Line 1147: { > warning: unused variable 'range_hash_component' [clang-diagnostic-unused-va Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1155 PS1, Line 1155: vector<string> columns = {"A"}; : vector<string> columns_1 = {"B"}; : AddRangePartitionWithSchema(schema, {std::make_pair("C", "d")}, {std::make_pair("C", "f")},{}, : {}, {make_tuple(columns, 2, 0), make_tuple(columns_1, 3, 0)}, : &bounds, &range_hash_schemas, &pb); : } : : // [(_, _, h), (_, _, j)) : { : AddRangePartitionWithSchema(schema, {std::make_pair("C", "h")}, {std::make_pair("C", "j")}, : {}, {}, {}, &bounds, &range_hash_schemas, &pb); : } : > nit: may be worth adding some helper to modify a partition schema? Maybe so Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1183 PS1, Line 1183: // Applies the specified predicates to a scan and checks that the expected : // number of partitions are pruned. > Just curious, is setting these different from setting range predicates on t There are three ways to set bounds on a scan. The first way being adding predicates (e.g. range/equality). The second way being setting lower and upper bound primary keys. The last way is setting lower and upper bound partition keys as shown in this test. However, setting partition keys is considered unstable and is only meant for internal use. http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1211 PS1, Line 1211: > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1312 PS1, Line 1312: Co > nit: move this down closer to where it's used. That way it's less distracti Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1337 PS1, Line 1337: > warning: Value stored to 'range_hash_component' during its initialization i Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1356 PS1, Line 1356: optional<tuple<int8_t, int8_t, int8_t>> upper, > warning: Value stored to 'range_hash_component' during its initialization i Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1375 PS1, Line 1375: CHECK_OK(upper_bound.SetInt8("b", get<1>(*upper))); > warning: unused variable 'range_hash_component' [clang-diagnostic-unused-va Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner-test.cc@1506 PS1, Line 1506: 7, 7); > warning: Value stored to 'range_hash_component' during its initialization i Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.h File src/kudu/common/partition_pruner.h: http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.h@66 PS1, Line 66: s true > nit: num_ranges ? Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.h@99 PS1, Line 99: : // A vector of a pair of lower and upper range bounds mapped to a : // reverse sorted set of partition key ranges. Each > +1 The tuple type was already in use before my changes, I added a pair of strings for the range bounds mostly because it's slightly easier to access the contents of the pair. If there's a preference for pair or tuple, I wouldn't mind only using one type. That being said, the distinction between the range bounds and partition key ranges will no longer exist if we conform to one type for both. Another alternative is using a struct but that will also be a bit redundant since for both range bounds and partition key ranges it's a pair of strings. http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc File src/kudu/common/partition_pruner.cc: http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@412 PS1, Line 412: // the lower and upper bounds specified by the scan. > nit: either through a comment or a variable name change (e.g. scan_range_lo Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@433 PS1, Line 433: as if t > nit: would be nice to add a comment describing this block, e.g. something l Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@435 PS1, Line 435: artition_sch > nit: We seem to be conflating the plurality of hash_schema and hash_schemas Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@434 PS1, Line 434: the range bounds specified by the scan. : if (!partition_sch > nit: when splitting a long declaration, I find it more convenient as a read Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@438 PS1, Line 438: const auto* hash_schemas = range.hash_schemas.empty() ? : &partition_schema.hash_bucket_schemas_ : &range.hash_schemas; : // Both lower and upper bounds are unbounded. : i > nit: we could simplify this as Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@444 PS1, Line 444: > Why can't we use emplace_back for 'hash_schemas'? Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@473 PS1, Line 473: vector<PartitionKeyRange> partition_key_ranges(1); : if (scan_range_lower_bound.empty() && scan_range_upper_bound.empty()) { : ConstructPartitionKeyRanges(schema, partition_schema, scan_spec, : *hash_schema, make_pair(range_bounds[i].first, : > I don't quite follow this case. If the scan's lower bound is below the rang Imagine there are 3 range bounds: (0, 10), (10, 20), (20, 30). If the scan's lower range bound is 5 and the scan's upper range bound is 25, then in order to add the hash schemas from the bound (10, 20) we need the last if statement. In regards to your assumption about the scan's upper bound being below the range's lower bound, it only falls to the last if statement if the scan's upper bound isn't fully contained within the current range we are iterating through. All that being said, this case is merged with the new simplification of the code. http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@463 PS1, Line 463: range_bounds.emplace_back(range.lower, range.upper); : hash_schemas_per_range.emplace_back(hash_schemas); : } : } : partition_key_ranges_.resize(hash_schemas_per_range.size()); : DCHECK_EQ(range_bounds.size(), hash_schemas_per_range.size()); : // Constructs partition key ranges from the ranges and their respective hash schemas : // that fell within the scan's bounds. : for (int i = 0; i < hash_schemas_per_range.size(); i++) { : const auto& hash_schema = hash_schemas_per_range[i]; : vector<PartitionKeyRange> partition_key_ranges(1); : if (scan_range_lower_bound.empty() && scan_range_upper_bound.empty()) { : ConstructPartitionKeyRanges(schema, partition_schema, scan_spec, : *hash_schema, make_pair(range_bounds[i].first, : > Maybe I'm missing something, but it seems like this code should just be che Thanks for pointing this out, the simplified code I ended up writing has different equality checks than the one you presented, but it led me down the right path. http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@502 PS1, Line 502: if (!scan_spe > nit: IMO these step counts aren't super useful and are a bit confusing now Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@518 PS1, Line 518: er_bound.empty()) { > nit: this looks a bit strange to re-use test-only method as a base for regu I agree, the original motivation behind the method seems to be for tests only, but I don't see why we couldn't use it for this method. The alternative is just duplicating the code from `NumRangesRemainingForTests()` for this method. http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@523 PS1, Line 523: _key_range : partition_key_range > Could it happen that partition_key_ranges_ is empty? Maybe, add correspond There's a CHECK() in the line above that does that. http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@524 PS1, Line 524: nge = partition_key_range.second. > nit: BTW, there is std::vector::back() method Done http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@550 PS1, Line 550: _bound(partition_key_range.s > Would it be possible to use partition_key_start() without wrapping it into Both the `range_key_start()` and `range_key_end()` methods within the Partition class return a Slice. Maybe I could use the data wrapped within the Slice directly. Any reason why we don't want to use a Slice here? http://gerrit.cloudera.org:8080/#/c/17643/1/src/kudu/common/partition_pruner.cc@586 PS1, Line 586: pace kudu : > nit: individual values that span more than one line should be indented such Done -- To view, visit http://gerrit.cloudera.org:8080/17643 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I05c37495430f61a2c6f6012c72251138aee465b7 Gerrit-Change-Number: 17643 Gerrit-PatchSet: 2 Gerrit-Owner: Mahesh Reddy <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy <[email protected]> Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Thu, 15 Jul 2021 05:46:09 +0000 Gerrit-HasComments: Yes
