Caveat lector: This is a lengthy email but I have put some thought into
it so it could be rewarding.
I've been lately thinking about constraint enforcement.
Currently there is a logical problem concerning the CHECK constraint.
At the moment constraints belong to either a column or a table.
With UNIQUE or FOREIGN KEY this is not a problem because they naturally
operate on a single table (well foreign key does have it's repercussions
on the referred table, too).
But since CHECK constraint supports arbitrary expressions (which is a
good thing)
it doesn't just belong to a single column or table but more generically
to the database in general (like table or view).
Thus the guarding of the constraint is only partial and leads inevitably
to integrity violations.
The correct way to enforce the check constraint would be to associate it
to all tables to which it refers.
And everytime a table that belongs to a constraint is modified the
constraint would be checked.
This got me also thinking that currently checking check constraint both
at column and table level is duplication
For example the following two check constraints are exactly the same.
1) Column level:
CREATE TABLE T (COL INT CHECK COL > (SELECT SUM(COL2) FROM T2) )
2) Table level
CREATE TABLE T (COL INT, CONSTRAINT C CHECK (SELECT SUM(COL2) FROM T2)
<ANY (SELECT COL FROM T))
And both of them fail to enforce the constraint when table T2 is modified.
From the second formulation we can see that the constraint depends on.
T.COL and T2.COL2 so whenever either one is modified the constraint
should be checked.
I suppose these dependencies could be retrieved something like
HashSet<DbObject> deps = HashSet<DbObject>();
expression.isEverything(ExpressionVisitor.getDependenciesVisitor(deps));
But here we confront another big issue.
Constraints are currently checked before the modification is actually done.
For example with INSERT the column constraints are checked at
line 120: table.validateConvertUpdateSequence(session, newRow);
This call takes also care of of computed columns in the new row.
Then table constraints are checked right after it at
line 121: boolean done = table.fireBeforeRow(session, null, newRow);
In this sequence the constraint would not check the state of affairs
after the modification but before.
And therefore it would not catch a constraint violation with the new row
but without it (which is pointless of course).
So I would suggest that the CHECK constraint checks should be moved
after the modification.
Lines 124-126 in Insert are
table.addRow(session, newRow);
session.log(table, UndoLogRecord.INSERT, newRow);
table.fireAfterRow(session, null, newRow, false);
This can be done simply by changing isBefore() method in CheckConstraint
to return false.
Which raises the question:
Why are other constraints before but referential constraint after?
Is it because of the same reason that I have been talking about?
And finally we come to the question of performance.
It is clear that when the constraint is defined at the column level
(example #1) then we only need to run select
SELECT SUM(COL2) FROM T2
once for the inserted row.
But if the constraint is formulated like in example 2
CREATE TABLE T (COL INT, CONSTRAINT C CHECK (SELECT SUM(COL2) FROM T2)
<ANY (SELECT COL FROM T))
then in order to run the constraint check as efficiently we would need
to optimize the CHECK statement to
(SELECT SUM(COL2) FROM T2) <ANY (SELECT COL FROM T WHERE
COL=<THE_NEW_VALUE>)
But if the insert is done to table T2 then all rows in T must be checked.
There is no way around it.
The optimization in my particular example could be done by the user by
formulating the constraint like this
CREATE TABLE T (COL INT, CONSTRAINT C CHECK (SELECT SUM(COL2) FROM T2) <
(SELECT MAX(COL) FROM T))
But there is no generic way to optimize all constraints in all cases and
sometimes enforcing a constraint just happens to be costly.
Anyway the first optimization could be done in this case but if the
constraint were
CREATE TABLE T (COL INT, CONSTRAINT C CHECK (SELECT SUM(COL2) FROM T2) <
(SELECT SUM(COL) FROM T))
there would not be any room for that kind of optimization.
One way to keep this optimization when T is updated would be to keep the
constraint associated with the column but create another generic version
of the constraint that is associated with other tables that are
mentioned in the constraint. But this would complicate the code in my
opinion and make it hard to understand.
A more generic way would be to try to optimize the query having the
changed row at hand. If the row belongs to a subquery where there are NO
AGGREGATIONS then the optimization could be inserted into the query.
Like I just showed:
(SELECT SUM(COL2) FROM T2) <ANY (SELECT COL FROM T WHERE
COL=<THE_NEW_VALUE>)
The added portion being WHERE COL=<THE_NEW_VALUE>
But here we would need to pick a column from table T that is used.
If the table has a primary key (or any key) then the first key column(s)
could be chosen.
But if it does not have any key columns then I think this optimization
becomes impossible.
In that case the constraint query can still be executed. It would just
run a bit slower.
Many details are still unclear to me but I wanted to share my ideas so
that those interested can give me advice, help and opinions.
If this idea works well then my next development idea would be to
include check constraints when turning off referential constraints in
the database. The reason being that it is very difficult to import data
to a database that has these kinds of complex constraints.
And finally I would try to improve the functionality of turning the
constraints back on.
There it would be a wonderful feature that when turning them on the DBMS
would go over all the data and check that those constraints are
respected. If not then it would return a result set that details all the
constraint violations.
The user could fix these violations and then try turning on the
constraints back on again.
This would make users more secure about the consistency of their data.
- Rami
--
You received this message because you are subscribed to the Google Groups "H2
Database" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/h2-database?hl=en.