Ioana Delaney created SPARK-19842:
-------------------------------------
Summary: Informational Referential Integrity Constraints Support
in Spark
Key: SPARK-19842
URL: https://issues.apache.org/jira/browse/SPARK-19842
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.2.0
Reporter: Ioana Delaney
*Informational Referential Integrity Constraints Support in Spark*
This work proposes support for _informational primary key_ and _foreign key
(referential integrity) constraints_ in Spark. The main purpose is to open up
an area of query optimization techniques that rely on referential integrity
constraints semantics.
An _informational_ or _statistical constraint_ is a constraint such as a
_unique_, _primary key_, _foreign key_, or _check constraint_, that can be used
by Spark to improve query performance. Informational constraints are not
enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize
the query processing. They provide semantics information that allows Catalyst
to rewrite queries to eliminate joins, push down aggregates, remove unnecessary
Distinct operations, and perform a number of other optimizations. Informational
constraints are primarily targeted to applications that load and analyze data
that originated from a data warehouse. For such applications, the conditions
for a given constraint are known to be true, so the constraint does not need to
be enforced during data load operations.
The attached document covers constraint definition, metastore storage,
constraint validation, and maintenance. The document shows many examples of
query performance improvements that utilize referential integrity constraints
and can be implemented in Spark.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]