At 11:19 AM -0500 1/20/08, Fowler, Jeff wrote:
To restate briefly, ANSI SQL-92 specifies that when comparing two character fields, trailing spaces should be ignored. Correct me if I'm wrong Darren, but you feel this is a bad decision, and in fact SQLite's implementation of character comparison (respecting trailing spaces) is superior to ANSI's specs.

Yes, that is indeed what I am saying.

More broadly speaking, and this may already be familiar to some of you who remember several of my writings over the last few years, I believe that while SQL has a lot of good things going for it, it also has numerous flaws, some of which are quite severe in their consequences. I am specifically addressing the ANSI/ISO SQL standard itself with this blame, not any implementation in particular.

I make this assessment of SQL both in respect to how much SQL is able to represent the relational model of data that Codd proposed to be used for computer databases, and in respect to how much SQL is constructed according to well-established principles of good language design.

As far as I am concerned, any quasi-implementation of SQL that addresses these flaws is something to applaud.

And at times that it seems SQLite is already doing things a better way, I am inclined to argue in support of its current status quo.

I won't address/re-address the other perceived SQL flaws in this thread, to stay on topic, but I'll further clarify my position on the space-pad thing in light of the previous paragraphs. It may even appear that I changed or reversed my position, but I don't feel that it changed.

1. The most important thing to have in regards to data types and values is to have a fully deterministic (and preferrably simple) concept of value identity, that is, when 2 containers are considered to hold identical values or not, or should I say, when 2 appearances of values are in fact the same one value.

2. The same conceptual value can have multiple physical representations, but this distinction is meant to be abstracted away from the user, so for example if the definition of a data type says that the representations 2.0 and 2.00 are the same value, then an equality test on them should return true; that said, users should not even see the difference then; any display of either physical representation to the user should be normalized to the same thing, such as 2, so when 2 values are considered equal by the system, they look the same to the user, but it is still okay to store them differently behind the scenes.

3. It is okay in the general case for a system's conception of value identity to be different than another system's as long as the rules are clearly documented. In this respect, it is okay for either trailing spaces to be significant, or for them to be non-significant, for determining identity (and by extension, equality), as long as these rules are consistently applied everywhere that value appears. Eg, 2 given character strings Foo and Bar can't be considered identical in some contexts and non-identical in other contexts. If you want to have it both ways, you need to have 2 distinct data types which happen to look similar, eg a CharStrSpSignif data type and a CharStrSpInsig data type, and then you use values of one type in one context and separate values of the other type in other contexts.

4. In this respect, if I don't misunderstand, SQLite's text data type is the CharStrSpSignif data type, and the SQL standard has the CharStrSpInsig type instead; if you consider the 2 systems as having different data types, then this difference of behaviour is explainable. Moreover, you can have your choice of the behaviour in different systems by having both types implemented there to choose from when you want, like you can choose between text and number types now.

5. A more practical example of #2, ignoring the whole spaces thing, is in regard to Unicode codepoints vs graphemes. Even if you are using a consistent byte encoding throughout, such as just UTF-8 or UTF-16-LE, you still have to be concerned with the fact that Unicode has multiple normal forms. Depending on your normal form, such as normal form C vs normal form D, you may have different sequences of code points representing the same grapheme. An example of a single grapheme being the combination of a plain roman letter plus a diacritical mark or accent; in NFC, that may be a single codepoint, in NFD, it might be a sequence of 2 code points. So, it is important for a character string data type to explicitly be considered either as a string of code points or of graphemes, for example. At the higher level abstraction, the 2 forms of letter+accent would be considered identical, but in the lower level abstraction, they would be non-identical. Note that afaik most high-level Unicode systems normally work in the highest abstraction level possible (whether they synchronize the normal form on storage or on compare is beside the point), as that is what users would expect; in which case, the actual codepoints in use would be considered non-significant, and be abstracted away from the user.

6. So as long as identity considerations are handled properly, it doesn't matter for satisfying the relational model of data as to whether trailing spaces are significant, just as it doesn't for graphemes vs codepoints abstraction. So then in this regard I consider SQLite's current approach and the SQL standard's proscription to be equally valid.

7. So my argument about that trailing spaces should be considered significant comes more down to what is considered well established principles of good language design. I would argue that if you want a simpler situation, that all the characters are significant, and that is what most programming languages do for character string literals.

8. If one wants to argue for the merits of ignoring trailing spaces, then I would ask for what reason and why stop there? I would imagine that a valid reason to consider said spaces insignificant is if, say, the text is meant to represent some human speech, and it is more just that there are spaces between or around words at all that is significant, not how many spaces. And so, in such a situation where trailing spaces are insignificant, I would think that having varying amounts of space characters between words is also insignificant, and comparisons should treat as if each word is separated by exactly one space.

9. And so an argument for all characters being significant is largely an argument in keeping things simple, which I think in the general case is what people expect. For situations where people expect different, they probably expect multiple other differences in conjunction with the trailing spaces thing, such as middle or leading spaces.

10. In the interests of useability, the base behaviour should be simpler, such as SQLite is, and special-casing strings should be built on top of that base, rather than the other way around. It's probably a lot easier or more elegant to add special cases than to remove them. Also such as drh provided with his new collation commit.

-- Darren Duncan

P.S. As another piece of full-disclosure, I'm in the midst of writing the spec for an industrial-quality programming language, named Muldis D, which is intended to replace SQL as the defacto language of choice for relational databases. I'm also significantly involved in the design of the Perl 6 language. So I have been looking at the relevant issues quite closely and I believe I can rationalize any arguments I make in regards to how a DBMS or a programming language should behave, and moreover that such differences from the SQL standard are viable in the real world for real work.

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to