At 11:19 AM -0500 1/20/08, Fowler, Jeff wrote:
To restate briefly, ANSI SQL-92 specifies that when comparing two
character fields, trailing spaces should be ignored. Correct me if
I'm wrong Darren, but you feel this is a bad decision, and in fact
SQLite's implementation of character comparison (respecting trailing
spaces) is superior to ANSI's specs.
Yes, that is indeed what I am saying.
More broadly speaking, and this may already be familiar to some of
you who remember several of my writings over the last few years, I
believe that while SQL has a lot of good things going for it, it also
has numerous flaws, some of which are quite severe in their
consequences. I am specifically addressing the ANSI/ISO SQL standard
itself with this blame, not any implementation in particular.
I make this assessment of SQL both in respect to how much SQL is able
to represent the relational model of data that Codd proposed to be
used for computer databases, and in respect to how much SQL is
constructed according to well-established principles of good language
design.
As far as I am concerned, any quasi-implementation of SQL that
addresses these flaws is something to applaud.
And at times that it seems SQLite is already doing things a better
way, I am inclined to argue in support of its current status quo.
I won't address/re-address the other perceived SQL flaws in this
thread, to stay on topic, but I'll further clarify my position on the
space-pad thing in light of the previous paragraphs. It may even
appear that I changed or reversed my position, but I don't feel that
it changed.
1. The most important thing to have in regards to data types and
values is to have a fully deterministic (and preferrably simple)
concept of value identity, that is, when 2 containers are considered
to hold identical values or not, or should I say, when 2 appearances
of values are in fact the same one value.
2. The same conceptual value can have multiple physical
representations, but this distinction is meant to be abstracted away
from the user, so for example if the definition of a data type says
that the representations 2.0 and 2.00 are the same value, then an
equality test on them should return true; that said, users should not
even see the difference then; any display of either physical
representation to the user should be normalized to the same thing,
such as 2, so when 2 values are considered equal by the system, they
look the same to the user, but it is still okay to store them
differently behind the scenes.
3. It is okay in the general case for a system's conception of value
identity to be different than another system's as long as the rules
are clearly documented. In this respect, it is okay for either
trailing spaces to be significant, or for them to be non-significant,
for determining identity (and by extension, equality), as long as
these rules are consistently applied everywhere that value appears.
Eg, 2 given character strings Foo and Bar can't be considered
identical in some contexts and non-identical in other contexts. If
you want to have it both ways, you need to have 2 distinct data types
which happen to look similar, eg a CharStrSpSignif data type and a
CharStrSpInsig data type, and then you use values of one type in one
context and separate values of the other type in other contexts.
4. In this respect, if I don't misunderstand, SQLite's text data
type is the CharStrSpSignif data type, and the SQL standard has the
CharStrSpInsig type instead; if you consider the 2 systems as having
different data types, then this difference of behaviour is
explainable. Moreover, you can have your choice of the behaviour in
different systems by having both types implemented there to choose
from when you want, like you can choose between text and number types
now.
5. A more practical example of #2, ignoring the whole spaces thing,
is in regard to Unicode codepoints vs graphemes. Even if you are
using a consistent byte encoding throughout, such as just UTF-8 or
UTF-16-LE, you still have to be concerned with the fact that Unicode
has multiple normal forms. Depending on your normal form, such as
normal form C vs normal form D, you may have different sequences of
code points representing the same grapheme. An example of a single
grapheme being the combination of a plain roman letter plus a
diacritical mark or accent; in NFC, that may be a single codepoint,
in NFD, it might be a sequence of 2 code points. So, it is important
for a character string data type to explicitly be considered either
as a string of code points or of graphemes, for example. At the
higher level abstraction, the 2 forms of letter+accent would be
considered identical, but in the lower level abstraction, they would
be non-identical. Note that afaik most high-level Unicode systems
normally work in the highest abstraction level possible (whether they
synchronize the normal form on storage or on compare is beside the
point), as that is what users would expect; in which case, the actual
codepoints in use would be considered non-significant, and be
abstracted away from the user.
6. So as long as identity considerations are handled properly, it
doesn't matter for satisfying the relational model of data as to
whether trailing spaces are significant, just as it doesn't for
graphemes vs codepoints abstraction. So then in this regard I
consider SQLite's current approach and the SQL standard's
proscription to be equally valid.
7. So my argument about that trailing spaces should be considered
significant comes more down to what is considered well established
principles of good language design. I would argue that if you want a
simpler situation, that all the characters are significant, and that
is what most programming languages do for character string literals.
8. If one wants to argue for the merits of ignoring trailing spaces,
then I would ask for what reason and why stop there? I would imagine
that a valid reason to consider said spaces insignificant is if, say,
the text is meant to represent some human speech, and it is more just
that there are spaces between or around words at all that is
significant, not how many spaces. And so, in such a situation where
trailing spaces are insignificant, I would think that having varying
amounts of space characters between words is also insignificant, and
comparisons should treat as if each word is separated by exactly one
space.
9. And so an argument for all characters being significant is
largely an argument in keeping things simple, which I think in the
general case is what people expect. For situations where people
expect different, they probably expect multiple other differences in
conjunction with the trailing spaces thing, such as middle or leading
spaces.
10. In the interests of useability, the base behaviour should be
simpler, such as SQLite is, and special-casing strings should be
built on top of that base, rather than the other way around. It's
probably a lot easier or more elegant to add special cases than to
remove them. Also such as drh provided with his new collation commit.
-- Darren Duncan
P.S. As another piece of full-disclosure, I'm in the midst of
writing the spec for an industrial-quality programming language,
named Muldis D, which is intended to replace SQL as the defacto
language of choice for relational databases. I'm also significantly
involved in the design of the Perl 6 language. So I have been
looking at the relevant issues quite closely and I believe I can
rationalize any arguments I make in regards to how a DBMS or a
programming language should behave, and moreover that such
differences from the SQL standard are viable in the real world for
real work.
-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------