Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/17544#discussion_r110060504
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala
---
@@ -0,0 +1,351 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.annotation.tailrec
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.planning.PhysicalOperation
+import org.apache.spark.sql.catalyst.plans._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * Encapsulates star-schema detection logic.
+ */
+case class StarSchemaDetection(conf: SQLConf) extends PredicateHelper {
+
+ /**
+ * Star schema consists of one or more fact tables referencing a number
of dimension
+ * tables. In general, star-schema joins are detected using the
following conditions:
+ * 1. Informational RI constraints (reliable detection)
+ * + Dimension contains a primary key that is being joined to the fact
table.
+ * + Fact table contains foreign keys referencing multiple dimension
tables.
+ * 2. Cardinality based heuristics
+ * + Usually, the table with the highest cardinality is the fact table.
+ * + Table being joined with the most number of tables is the fact table.
+ *
+ * To detect star joins, the algorithm uses a combination of the above
two conditions.
+ * The fact table is chosen based on the cardinality heuristics, and the
dimension
+ * tables are chosen based on the RI constraints. A star join will
consist of the largest
+ * fact table joined with the dimension tables on their primary keys. To
detect that a
+ * column is a primary key, the algorithm uses table and column
statistics.
+ *
+ * The algorithm currently returns only the star join with the largest
fact table.
+ * Choosing the largest fact table on the driving arm to avoid large
inners is in
+ * general a good heuristic. This restriction will be lifted to observe
multiple
+ * star joins.
+ *
+ * The highlights of the algorithm are the following:
+ *
+ * Given a set of joined tables/plans, the algorithm first verifies if
they are eligible
+ * for star join detection. An eligible plan is a base table access with
valid statistics.
+ * A base table access represents Project or Filter operators above a
LeafNode. Conservatively,
+ * the algorithm only considers base table access as part of a star join
since they provide
+ * reliable statistics. This restriction can be lifted with the CBO
enablement by default.
+ *
+ * If some of the plans are not base table access, or statistics are not
available, the algorithm
+ * returns an empty star join plan since, in the absence of statistics,
it cannot make
+ * good planning decisions. Otherwise, the algorithm finds the table
with the largest cardinality
+ * (number of rows), which is assumed to be a fact table.
+ *
+ * Next, it computes the set of dimension tables for the current fact
table. A dimension table
+ * is assumed to be in a RI relationship with a fact table. To infer
column uniqueness,
+ * the algorithm compares the number of distinct values with the total
number of rows in the
+ * table. If their relative difference is within certain limits (i.e.
ndvMaxError * 2, adjusted
+ * based on 1TB TPC-DS data), the column is assumed to be unique.
+ */
+ def findStarJoins(
+ input: Seq[LogicalPlan],
+ conditions: Seq[Expression]): Seq[LogicalPlan] = {
+
+ val emptyStarJoinPlan = Seq.empty[LogicalPlan]
+
+ if (!conf.starSchemaDetection || input.size < 2) {
+ emptyStarJoinPlan
+ } else {
+ // Find if the input plans are eligible for star join detection.
+ // An eligible plan is a base table access with valid statistics.
+ val foundEligibleJoin = input.forall {
+ case PhysicalOperation(_, _, t: LeafNode) if
t.stats(conf).rowCount.isDefined => true
+ case _ => false
+ }
+
+ if (!foundEligibleJoin) {
+ // Some plans don't have stats or are complex plans.
Conservatively,
+ // return an empty star join. This restriction can be lifted
+ // once statistics are propagated in the plan.
+ emptyStarJoinPlan
+ } else {
+ // Find the fact table using cardinality based heuristics i.e.
+ // the table with the largest number of rows.
+ val sortedFactTables = input.map { plan =>
+ TableAccessCardinality(plan, getTableAccessCardinality(plan))
+ }.collect { case t @ TableAccessCardinality(_, Some(_)) =>
+ t
+ }.sortBy(_.size)(implicitly[Ordering[Option[BigInt]]].reverse)
+
+ sortedFactTables match {
+ case Nil =>
+ emptyStarJoinPlan
+ case table1 :: table2 :: _
+ if table2.size.get.toDouble > conf.starSchemaFTRatio *
table1.size.get.toDouble =>
+ // If the top largest tables have comparable number of rows,
return an empty star plan.
+ // This restriction will be lifted when the algorithm is
generalized
+ // to return multiple star plans.
+ emptyStarJoinPlan
+ case TableAccessCardinality(factTable, _) :: rest =>
+ // Find the fact table joins.
+ val allFactJoins = rest.collect { case
TableAccessCardinality(plan, _)
+ if findJoinConditions(factTable, plan, conditions).nonEmpty
=>
+ plan
+ }
+
+ // Find the corresponding join conditions.
+ val allFactJoinCond = allFactJoins.flatMap { plan =>
+ val joinCond = findJoinConditions(factTable, plan,
conditions)
+ joinCond
+ }
+
+ // Verify if the join columns have valid statistics.
+ // Allow any relational comparison between the tables. Later
+ // we will heuristically choose a subset of equi-join
+ // tables.
+ val areStatsAvailable = allFactJoins.forall { dimTable =>
+ allFactJoinCond.exists {
+ case BinaryComparison(lhs: AttributeReference, rhs:
AttributeReference) =>
+ val dimCol = if (dimTable.outputSet.contains(lhs)) lhs
else rhs
+ val factCol = if (factTable.outputSet.contains(lhs)) lhs
else rhs
+ hasStatistics(dimCol, dimTable) &&
hasStatistics(factCol, factTable)
+ case _ => false
+ }
+ }
+
+ if (!areStatsAvailable) {
+ emptyStarJoinPlan
+ } else {
+ // Find the subset of dimension tables. A dimension table is
assumed to be in a
+ // RI relationship with the fact table. Only consider
equi-joins
+ // between a fact and a dimension table to avoid expanding
joins.
+ val eligibleDimPlans = allFactJoins.filter { dimTable =>
+ allFactJoinCond.exists {
+ case cond @ Equality(lhs: AttributeReference, rhs:
AttributeReference) =>
+ val dimCol = if (dimTable.outputSet.contains(lhs)) lhs
else rhs
+ isUnique(dimCol, dimTable)
+ case _ => false
+ }
+ }
+
+ if (eligibleDimPlans.isEmpty || eligibleDimPlans.size < 2) {
--- End diff --
uh, you move it to here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]