Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16674 )

Change subject: KUDU-1644 hash-partition based in-list predicate optimization
......................................................................


Patch Set 8:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/16674/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16674/7//COMMIT_MSG@2
PS7, Line 2:   ningw <[email protected]>
           : AuthorDate:
> this patch is target on
nit: could you add a TODO comment in PruneHashForInlistIfPossible mentioning 
this? I agree this would be a good improvement to make, and it'd be nice to 
have a pointer in the code both explaining the limitations of the code today, 
and ideas for improvement.


http://gerrit.cloudera.org:8080/#/c/16674/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16674/8//COMMIT_MSG@31
PS8, Line 31:
            : After:
            : Tablet 1: id in (0,3,6,9)
            : Tablet 2: id in (1,4,7,10)
            : Tablet 3: id in (2,5,8)
Making sure my understanding is correct here -- this doesn't reduce the number 
of rows sent back by the servers, but:
- it does reduce the CPU used when evaluating the in-list predicates since the 
predicates themselves are smaller
- it also allows us to short-circuit immediately if the in-list predicate only 
contains values that are not in a given tablet


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc
File src/kudu/common/scan_spec-test.cc:

http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc@473
PS8, Line 473: ASSERT_EQ(spec_p1.IsInListPredicateValuesEmpty(), false);
nit: you can use ASSERT_TRUE and ASSERT_FALSE for these


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc@503
PS8, Line 503: columnPB
nit: We use snake_case for variable names in our C++ code. Same elsewhere


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc@518
PS8, Line 518: parition
nit: "partition", same elsewhere


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc@553
PS8, Line 553: / Test that hash(a, b), hash(c) InList predicates.
             : // Neither a or b InList can be pruned.
             : // c InList should be pruned.
Could you also add a case for hash(a), hash(b)?


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec-test.cc@563
PS8, Line 563:
             :   PartitionSchema partition_schema;
             :   PartitionSchemaPB partition_schemaPB;
             :   auto* hashPB = partition_schemaPB.add_hash_bucket_schemas();
             :   hashPB->set_num_buckets(3);
             :   hashPB->set_seed(0);
             :   auto* columnPB_1_1 = hashPB->add_columns();
             :   int key_column_id_1_1 = schema.find_column("a");
             :   columnPB_1_1->set_id(key_column_id_1_1);
             :   columnPB_1_1->set_name("a");
             :
             :   auto* columnPB_1_2 = hashPB->add_columns();
             :   int key_column_id_1_2 = schema.find_column("b");
             :   columnPB_1_2->set_id(key_column_id_1_2);
             :   columnPB_1_2->set_name("b");
             :
             :   auto* hashPB_2_1 = 
partition_schemaPB.add_hash_bucket_schemas();
             :   hashPB_2_1->set_num_buckets(3);
             :   hashPB_2_1->set_seed(0);
             :   auto* columnPB_2_1 = hashPB_2_1->add_columns();
             :   int key_column_id_2_1 = schema.find_column("c");
             :   columnPB_2_1->set_id(key_column_id_2_1);
             :   columnPB_2_1->set_name("c");
             :
             :   ASSERT_OK(PartitionSchema::FromPB(partition_schemaPB, schema, 
&partition_schema));
nit: seems like this code is reused a bunch in these tests. Maybe define a 
function in an anonymous namespace like:

namespace {

PartitionSchema GeneratePartitionSchema(const vector<pair<vector<string>, int>> 
hash_partitions, ...) {
  PartitionsSchemaPB partition_schema_pb;
  for (const auto& col_names_and_num_buckets : hash_partitions) {
    auto* hash_pb = partition_schema_pb.add_hash_bucket_schemas();
    hash_pb->set_num_buckets(col_names_and_num_buckets.second);
    for (const auto& col_name : col_names_and_num_buckets.first) {
      hash_pb->(set fields...)
    }
  }
  PartitionSchema partition_schema;
  CHECK_OK(FromPB(partition_schema_pb, &partition_schema));
  return partition_schema;
}

} // anonymous namespace

That way, these tests can just call

 const auto partition_schema = GeneratePartitionSchema({ make_pair({ "a", "b" 
}, 3), make_pair({ "c" }, 1) });


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc
File src/kudu/common/scan_spec.cc:

http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc@88
PS8, Line 88:                   return predicate.second.predicate_type() == 
PredicateType::None;
Why shouldn't we put the InList empty check here?


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc@191
PS8, Line 191:     if (predicate.predicate_type() == PredicateType::InList) {
We should be able to push this to the front of the loop and avoid a 
FindColumn() call, e.g.

for (auto& p : predicates_) {
  if (predicate.predicate_type() != PredicateType::InList) continue;
  ...
}


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc@193
PS8, Line 193: col_name,
nit: we've already found the column index -- could we pass 'idx' or the column 
ID here and avoid an extra FindColumn()?


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc@193
PS8, Line 193: IsSingleColumnHashParition
nit: typo in "Partition"


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/common/scan_spec.cc@205
PS8, Line 205: Status s = Status::OK();
             :           s = partial_row.Set(idx, reinterpret_cast<const 
uint8_t*>(value));
nit: can merge these lines


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/tserver/tablet_service.cc@2667
PS8, Line 2667: // However, it may crash in coming 'OptimizeScan' call to use 
empty
              :     // predicate value, here could protects special case with 
minor cost during
              :     // setup stage.
If we pushed the emptiness check into CanShortCircuit(), would that avoid the 
crash?


http://gerrit.cloudera.org:8080/#/c/16674/8/src/kudu/tserver/tablet_service.cc@2737
PS8, Line 2737:   if (spec.CanShortCircuit()) {
              :     VLOG(1) << "short-circuiting without creating a server-side 
scanner.";
              :     *has_more_results = false;
              :     return Status::OK();
              :   }
Hmm, I'm not sure, but could we move this up to right after OptimizeScan()? 
Seems like otherwise we are doing some work even though we are not going to 
return anything. Or do we need to short circuit _after_ validating the scan 
spec?



--
To view, visit http://gerrit.cloudera.org:8080/16674
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I202001535669a72de7fbb9e766dbc27db48e0aa2
Gerrit-Change-Number: 16674
Gerrit-PatchSet: 8
Gerrit-Owner: wangning <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <[email protected]>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: wangning <[email protected]>
Gerrit-Comment-Date: Fri, 30 Oct 2020 19:55:14 +0000
Gerrit-HasComments: Yes

Reply via email to