Github user ioana-delaney commented on a diff in the pull request:
https://github.com/apache/spark/pull/15363#discussion_r105023746
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
---
@@ -20,19 +20,340 @@ package org.apache.spark.sql.catalyst.optimizer
import scala.annotation.tailrec
import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import org.apache.spark.sql.catalyst.planning.{BaseTableAccess,
ExtractFiltersAndInnerJoins}
import org.apache.spark.sql.catalyst.plans._
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class DetectStarSchemaJoin(conf: CatalystConf) extends
PredicateHelper {
+
+ /**
+ * Star schema consists of one or more fact tables referencing a number
of dimension
+ * tables. In general, star-schema joins are detected using the
following conditions:
+ * 1. Informational RI constraints (reliable detection)
+ * + Dimension contains a primary key that is being joined to the
fact table.
+ * + Fact table contains foreign keys referencing multiple dimension
tables.
+ * 2. Cardinality based heuristics
+ * + Usually, the table with the highest cardinality is the fact
table.
+ * + Table being joined with the most number of tables is the fact
table.
+ *
+ * To detect star joins, the algorithm uses a combination of the above
two conditions.
+ * The fact table is chosen based on the cardinality heuristics, and the
dimension
+ * tables are chosen based on the RI constraints. A star join will
consist of the largest
+ * fact table joined with the dimension tables on their primary keys. To
detect that a
+ * column is a primary key, the algorithm uses table and column
statistics.
+ *
+ * Since Catalyst only supports left-deep tree plans, the algorithm
currently returns only
--- End diff --
@hvanhovell A star join will indeed be represented by a left-deep tree. But
if a query has more than one star schema join, ideally you want to plan them as
a bushy-tree i.e. (star-join-1) Join (star-join2).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]