All, and particularly Dr. Hipp,
Lately, my own database project in Perl, named Rosetta, has evolved
to officially be an implementation of Chris Date's and Hugh Darwen's
proposal for relational databases called "The Third Manifesto", which
is talked about at http://www.thethirdmanifesto.com/ and in various
books and papers of theirs.
Rosetta has its own API which defines a "D" relational database
language, as apposed to a SQL language, and Rosetta's back-ends to
implement this are interchangeable. I am making a pure Perl
reference implementation (called Rosetta::Engine::Example) which is
coded simply for ease of understanding but that is slow.
Separately, I or third parties would be making alternate back-ends
that either are self-implemented and better performing, or that
constitute wrappers over existing database products, usually SQL
based, since those are fairly mature and plentiful. SQLite is one of
the first such back-ends to be used.
Now, I would like to propose, and if necessary I will contribute
significant targeted funding (when I have the money) to pay Dr. Hipp
and/or other developers, some significant feature changes (as a fork
if necessary) to SQLite such that it directly implements and gains
the efficiencies of The Third Manifesto. This includes both the
addition of and the removal of current features, and certain
behaviours would change. Hopefully all for the better.
As a result of these changes, not only would SQLite better serve as a
back-end of Rosetta, but non-Rosetta users of SQLite would get the
most critical of the same benefits from it directly.
I anticipate that the changes would mainly affect the upper layers,
which convert user commands into virtual machine code, but that the
virtual machine and b-tree and OS layers would remain more or less
unchanged (this depends, of course, on a few details). Possibly, we
would add a new command language.
I am hoping that, to keep longer term maintenance easier, these
changes can be implemented in the trunk and activated using either
run time pragmas or compile time options or both. But if they would
require a fork, then the forked product would have to be named
something else that doesn't have 'SQL' in its name, since SQL does
not satisfy The Third Manifesto. Maybe 'TTMLite' or something that
sounds better.
Here are some of the changes that I propose the pragma or compile
time option or fork would have; they all refer to what the user sees,
not to implementation details that should be hidden:
1. Add a distinct logical BOOLEAN data type. It is the data type of
output from logical expressions like comparisons, and the input to
'and', 'or', etc.
2. Have strong and strict data typing for both variables and values.
2.1 Table columns are always declared to be of a specific type (eg:
BOOLEAN, INTEGER, REAL, TEXT, BLOB) and nothing but values of the
same type can be stored in them; attempting to do otherwise would
fail with an exception.
2.2 The plain equality test is supported for all data types.
2.3 All operators/functions have strongly typed parameters and
return values, and invoking them with arguments that aren't of the
right type will fail with an exception. The equality test likewise
can only compare operands of the same type.
2.4 There is no implicit type conversion; data types must be
explicitly converted from one type to another.
2.5 INTEGER and REAL data types have separate sets of operators,
which do the expected thing with their types. For example, each has
a separate division operator whose input and output are all of that
same type. No worrying about when to round or not.
2.6 SQLite may already be this way, but: All characters in a string
are significant, including whitespace, so 'a' and 'a ' are always
unequal.
3. There is no such thing as a NULL.
3.1 All logic is 2VL (true, false) not 3VL (true, false, unknown).
3.2 Every variable of a particular type always contains a value that
is valid for that type, so logic for dealing with it is simpler.
Likewise, every with every literal value.
3.3 The code to implement operators is a lot simpler.
3.4 Missing data can be either represented with the data type's
empty value, or a table column that may possibly be unknown can be
split into a separate related table, that only has records when the
value is known.
3.5 All variables default to a reasonable valid value for their type
if not explicitly set, such as the number zero or the empty string.
4. There is no significant hidden data. A row id can only be an
explicitly declared table column. The implementation of a table can
use hidden row ids, but the user wouldn't see them.
5. No duplicate rows in tables or queries are allowed.
5.1 In SQL terms, every table has an implicit unique key constraint
over all of its columns. This is ignored if there are any actual
explicit keys, whether primary or otherwise. In TTM terms, it is
impossible by definition to have duplicate rows.
5.2 The results of all stages of queries do not contain duplicate
rows. In SQL terms, every query or subquery has an implicit
'distinct' or 'group by all' clause on it. No joins produce
duplicates. No unions etc do either.
5.3 By doing this and #3, all queries that look like they should
return the same results actually do, whereas in SQL they may return
different results in the presence of duplicates or nulls. Queries
can also be simpler.
6. Columns in tables and views and query results have no ordinal
value; they all have names and are referred to using only those
names. Moreover, every column must have a different name from every
other column.
7. Rows in tables and views and query results have no ordinal value;
they are referenced by relational expressions that match on the
values of columns, like in a SQL where-clause.
7.1 An order-by or limit clause only makes sense in an outer-most
query, right when results are being returned from the database to the
application, where it then specifies the order to return otherwise
order-less rows.
In doing all of the above, SQLite should actually be simpler to
implement, and it will be easier to use, with more predictable
results and fewer bugs.
This next one can be implemented separately from all the other suggestions:
8. Add some standard relational logic operators that can be combined
and nested to get all the power of selects and more, with less
effort, such as any of the following you don't already have:
restrict, project, join, product, union, intersection, difference,
divide, rename.
8.1 The simplest join syntax, such as an unqualified comma-delimited
list, would perform a natural join by default. Or we could more or
less just have natural joins (and cartesian products, 'product') as
the only kind of join.
8.2 Using these instead of 'select' should allow for easier
implementation and optimization; for one thing, the expressions are
more associative or commutative.
This next one can be implemented separately from all the other suggestions:
9. Support nested/child transactions, such as a 'begin transaction'
inside another one, which can make things a lot easier for
applications; they have to worry less about whether a transaction
already exists before starting another one. These are functionally
sort of like save-points in SQL, in that even if an inner transaction
commits, it is still thrown away if the outer transaction rolls back.
To implement this best, you would probably need multiple (cascading?)
journal files, one per transaction level.
Following are also features of The Third Manifesto, but can possibly
be left out of SQLite in accordance with its Lite nature:
1. All views are updateable like they were tables. From the user's
point of view, tables and views are the same sort of thing in how
they can be used.
2. Tables can be assigned to directly like they were variables, and
insert/update/delete is actually a short-hand for this. Eg, an
insert is equivalent to an assignment to a table of the table's old
value unioned with the rows being inserted. Supporting this allows
users to define arbitrarily flexible updating operations, such as
"replace or add" and such.
3. The system catalog tables can be updated directly using data
definition language, which results in the schema being updated. Eg,
you can use insert statements to create a table rather than a create
statement.
4. Support definition and use of custom data types.
5. It should not be necessary to explicitly declare indexes to help
with speed.
6. Generally speaking, users should not have to know about
implementation details, but rather just express what their data
actually means.
Okay, that's about all for this initial proposal email.
Ultimately, I believe that the core of my proposal involves
simplifying SQLite, making it leaner and meaner, and also reduces
possible or actual bugs or difficulty in understanding.
At the very least, I hope that the trunk would have the pragma or
compile option that essentially strips out current features like
nulls and other ambiguity, so essentially we have a restricted or
simplified SQL.
I also bring this up because I would expect that SQLite should be
able to perform faster when it doesn't handle nulls or duplicates or
weak data types than if it does. The conceptual logic is simpler
when we don't have those, and the implementation code should also be
simpler, and perform faster, since there are fewer possibilities to
check at logical decision points. And it should be easier to
optimize queries.
So even if no incompatible changes are made, I would hope that it is
possible to optimize for the simplest case.
-- Darren Duncan