Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16674 )
Change subject: KUDU-1644 hash-partition based in-list predicate optimization ...................................................................... Patch Set 8: (14 comments) http://gerrit.cloudera.org:8080/#/c/16674/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16674/7//COMMIT_MSG@2 PS7, Line 2: ningw <[email protected]> : AuthorDate: > this patch is target on nit: could you add a TODO comment in PruneHashForInlistIfPossible mentioning this? I agree this would be a good improvement to make, and it'd be nice to have a pointer in the code both explaining the limitations of the code today, and ideas for improvement. http://gerrit.cloudera.org:8080/#/c/16674/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16674/8//COMMIT_MSG@31 PS8, Line 31: : After: : Tablet 1: id in (0,3,6,9) : Tablet 2: id in (1,4,7,10) : Tablet 3: id in (2,5,8) Making sure my understanding is correct here -- this doesn't reduce the number of rows sent back by the servers, but: - it does reduce the CPU used when evaluating the in-list predicates since the predicates themselves are smaller - it also allows us to short-circuit immediately if the in-list predicate only contains values that are not in a given tablet http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc File src/kudu/common/scan_spec-test.cc: http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc@473 PS8, Line 473: ASSERT_EQ(spec_p1.IsInListPredicateValuesEmpty(), false); nit: you can use ASSERT_TRUE and ASSERT_FALSE for these http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc@503 PS8, Line 503: columnPB nit: We use snake_case for variable names in our C++ code. Same elsewhere http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc@518 PS8, Line 518: parition nit: "partition", same elsewhere http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc@553 PS8, Line 553: / Test that hash(a, b), hash(c) InList predicates. : // Neither a or b InList can be pruned. : // c InList should be pruned. Could you also add a case for hash(a), hash(b)? http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc@563 PS8, Line 563: : PartitionSchema partition_schema; : PartitionSchemaPB partition_schemaPB; : auto* hashPB = partition_schemaPB.add_hash_bucket_schemas(); : hashPB->set_num_buckets(3); : hashPB->set_seed(0); : auto* columnPB_1_1 = hashPB->add_columns(); : int key_column_id_1_1 = schema.find_column("a"); : columnPB_1_1->set_id(key_column_id_1_1); : columnPB_1_1->set_name("a"); : : auto* columnPB_1_2 = hashPB->add_columns(); : int key_column_id_1_2 = schema.find_column("b"); : columnPB_1_2->set_id(key_column_id_1_2); : columnPB_1_2->set_name("b"); : : auto* hashPB_2_1 = partition_schemaPB.add_hash_bucket_schemas(); : hashPB_2_1->set_num_buckets(3); : hashPB_2_1->set_seed(0); : auto* columnPB_2_1 = hashPB_2_1->add_columns(); : int key_column_id_2_1 = schema.find_column("c"); : columnPB_2_1->set_id(key_column_id_2_1); : columnPB_2_1->set_name("c"); : : ASSERT_OK(PartitionSchema::FromPB(partition_schemaPB, schema, &partition_schema)); nit: seems like this code is reused a bunch in these tests. Maybe define a function in an anonymous namespace like: namespace { PartitionSchema GeneratePartitionSchema(const vector<pair<vector<string>, int>> hash_partitions, ...) { PartitionsSchemaPB partition_schema_pb; for (const auto& col_names_and_num_buckets : hash_partitions) { auto* hash_pb = partition_schema_pb.add_hash_bucket_schemas(); hash_pb->set_num_buckets(col_names_and_num_buckets.second); for (const auto& col_name : col_names_and_num_buckets.first) { hash_pb->(set fields...) } } PartitionSchema partition_schema; CHECK_OK(FromPB(partition_schema_pb, &partition_schema)); return partition_schema; } } // anonymous namespace That way, these tests can just call const auto partition_schema = GeneratePartitionSchema({ make_pair({ "a", "b" }, 3), make_pair({ "c" }, 1) }); http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc File src/kudu/common/scan_spec.cc: http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc@88 PS8, Line 88: return predicate.second.predicate_type() == PredicateType::None; Why shouldn't we put the InList empty check here? http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc@191 PS8, Line 191: if (predicate.predicate_type() == PredicateType::InList) { We should be able to push this to the front of the loop and avoid a FindColumn() call, e.g. for (auto& p : predicates_) { if (predicate.predicate_type() != PredicateType::InList) continue; ... } http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc@193 PS8, Line 193: col_name, nit: we've already found the column index -- could we pass 'idx' or the column ID here and avoid an extra FindColumn()? http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc@193 PS8, Line 193: IsSingleColumnHashParition nit: typo in "Partition" http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc@205 PS8, Line 205: Status s = Status::OK(); : s = partial_row.Set(idx, reinterpret_cast<const uint8_t*>(value)); nit: can merge these lines http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/tserver/tablet_service.cc@2667 PS8, Line 2667: // However, it may crash in coming 'OptimizeScan' call to use empty : // predicate value, here could protects special case with minor cost during : // setup stage. If we pushed the emptiness check into CanShortCircuit(), would that avoid the crash? http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/tserver/tablet_service.cc@2737 PS8, Line 2737: if (spec.CanShortCircuit()) { : VLOG(1) << "short-circuiting without creating a server-side scanner."; : *has_more_results = false; : return Status::OK(); : } Hmm, I'm not sure, but could we move this up to right after OptimizeScan()? Seems like otherwise we are doing some work even though we are not going to return anything. Or do we need to short circuit _after_ validating the scan spec? -- To view, visit http://gerrit.cloudera.org:8080/16674 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I202001535669a72de7fbb9e766dbc27db48e0aa2 Gerrit-Change-Number: 16674 Gerrit-PatchSet: 8 Gerrit-Owner: wangning <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy <[email protected]> Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: wangning <[email protected]> Gerrit-Comment-Date: Fri, 30 Oct 2020 19:55:14 +0000 Gerrit-HasComments: Yes
