windpiger commented on issue #18994: [SPARK-21784][SQL] Adds support for 
defining informational primary key and foreign key constraints using ALTER 
TABLE DDL.
URL: https://github.com/apache/spark/pull/18994#issuecomment-483989274
 
 
   I think Constraint should be designed with DataSource v2 and can do more 
than [SPARK-19842](https://issues.apache.org/jira/browse/SPARK-19842).
   
   Constraint can be used to:
   1. data integrity(not include in 
[SPARK-19842](https://issues.apache.org/jira/browse/SPARK-19842))
   2. optimizer can use it to rewrite query to gain perfermance(not just PK/FK, 
unique/not null is also useful)
   
   For data integrity, we have two scenarios:
   1.1 DataSource native support data integrity, such as mysql/oracle and so on
   Spark should only call read/write API of this DataSource, and do nothing 
about data integrity.
   1.2 DataSource do not support data integrity, such as csv/json/parquet and 
so on
   Spark can provide data integrity for this DataSource like Hive does(maybe a 
switch can be used to turn it off), and we can discuss to support which kind of 
Constraint.
   For example, Hive support PK/FK/UNIQUE(DISABLE RELY)/NOT NUL/DEFAULT, NOT 
NULL ENFORCE check is implement by add an extra UDF 
GenericUDFEnforceNotNullConstraint to the 
Plan([HIVE-16605](https://issues.apache.org/jira/browse/HIVE-16605)).
   
   For Optimizer rewrite query:
   2.1 We can add Constraint Information into CatalogTable which is returned by 
catalog.getTable API. Then Optimizer can use it to do query rewrite.
   2.2 if we can not get Constraint information, we can use hint to the SQL
   
   Above all, we can bring Constraint feature to DataSource v2 design:
   a) to support 2.1 feature, we can add constraint information to 
createTable/alterTable/getTable API in this 
SPIP(https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#)
   b) to support data integrity, we can add ConstaintSupport mix-in for 
DataSource v2:
   if one DataSource support Constraint, then Spark do nothing when insert data;
   if one DataSource do not support Constraint but still want to do constraint 
check, then Spark should do the constraint check like Hive(such as not null in 
Hive add a extra udf GenericUDFEnforceNotNullConstraint to the Plan).
   if one DataSource do not support Constraint and do not want to do constraint 
check, then Spark do nothing.
   
   Hive catalog support constraint, we can implement this logic in 
createTable/alterTable API . Then we can use SparkSQL DDL to create Table with 
constraint which stored to HiveMetaStore by Hive catalog API.
   for example:CREATE TABLE t(a STRING, b STRING NOT NULL DISABLE, CONSTRAINT 
pk1 PRIMARY KEY (a) DISABLE) USING parquet;
   
   **_As for how to store constraint_**, because Hive 2.1 has provide 
constraint API in Hive.java, we can call it directly in createTable/alterTable 
API of Hive catalog. There is no need to use table properties to store these
   constraint information by Spark. There are some concern for using Hive 2.1 
catalog API directly in the 
docs(https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit#heading=h.lnxbz9),
 such as Spark built-in Hive is 1.2.1, but upgrade Hive to 2.3.4 is 
inprogress([SPARK-23710](https://issues.apache.org/jira/browse/SPARK-23710)).
   
   @cloud-fan @gatorsmile @sureshthalamati @ioana-delaney
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to