RE: [sqlite] SQLite character comparisons

Darren Duncan Sun, 20 Jan 2008 18:55:28 -0800

At 11:19 AM -0500 1/20/08, Fowler, Jeff wrote:

To restate briefly, ANSI SQL-92 specifies that when comparing twocharacter fields, trailing spaces should be ignored. Correct me ifI'm wrong Darren, but you feel this is a bad decision, and in factSQLite's implementation of character comparison (respecting trailingspaces) is superior to ANSI's specs.


Yes, that is indeed what I am saying.

More broadly speaking, and this may already be familiar to some ofyou who remember several of my writings over the last few years, Ibelieve that while SQL has a lot of good things going for it, it alsohas numerous flaws, some of which are quite severe in theirconsequences. I am specifically addressing the ANSI/ISO SQL standarditself with this blame, not any implementation in particular.

I make this assessment of SQL both in respect to how much SQL is ableto represent the relational model of data that Codd proposed to beused for computer databases, and in respect to how much SQL isconstructed according to well-established principles of good languagedesign.

As far as I am concerned, any quasi-implementation of SQL thataddresses these flaws is something to applaud.

And at times that it seems SQLite is already doing things a betterway, I am inclined to argue in support of its current status quo.

I won't address/re-address the other perceived SQL flaws in thisthread, to stay on topic, but I'll further clarify my position on thespace-pad thing in light of the previous paragraphs. It may evenappear that I changed or reversed my position, but I don't feel thatit changed.

1. The most important thing to have in regards to data types andvalues is to have a fully deterministic (and preferrably simple)concept of value identity, that is, when 2 containers are consideredto hold identical values or not, or should I say, when 2 appearancesof values are in fact the same one value.

2. The same conceptual value can have multiple physicalrepresentations, but this distinction is meant to be abstracted awayfrom the user, so for example if the definition of a data type saysthat the representations 2.0 and 2.00 are the same value, then anequality test on them should return true; that said, users should noteven see the difference then; any display of either physicalrepresentation to the user should be normalized to the same thing,such as 2, so when 2 values are considered equal by the system, theylook the same to the user, but it is still okay to store themdifferently behind the scenes.

3. It is okay in the general case for a system's conception of valueidentity to be different than another system's as long as the rulesare clearly documented. In this respect, it is okay for eithertrailing spaces to be significant, or for them to be non-significant,for determining identity (and by extension, equality), as long asthese rules are consistently applied everywhere that value appears.Eg, 2 given character strings Foo and Bar can't be consideredidentical in some contexts and non-identical in other contexts. Ifyou want to have it both ways, you need to have 2 distinct data typeswhich happen to look similar, eg a CharStrSpSignif data type and aCharStrSpInsig data type, and then you use values of one type in onecontext and separate values of the other type in other contexts.

4. In this respect, if I don't misunderstand, SQLite's text datatype is the CharStrSpSignif data type, and the SQL standard has theCharStrSpInsig type instead; if you consider the 2 systems as havingdifferent data types, then this difference of behaviour isexplainable. Moreover, you can have your choice of the behaviour indifferent systems by having both types implemented there to choosefrom when you want, like you can choose between text and number typesnow.

5. A more practical example of #2, ignoring the whole spaces thing,is in regard to Unicode codepoints vs graphemes. Even if you areusing a consistent byte encoding throughout, such as just UTF-8 orUTF-16-LE, you still have to be concerned with the fact that Unicodehas multiple normal forms. Depending on your normal form, such asnormal form C vs normal form D, you may have different sequences ofcode points representing the same grapheme. An example of a singlegrapheme being the combination of a plain roman letter plus adiacritical mark or accent; in NFC, that may be a single codepoint,in NFD, it might be a sequence of 2 code points. So, it is importantfor a character string data type to explicitly be considered eitheras a string of code points or of graphemes, for example. At thehigher level abstraction, the 2 forms of letter+accent would beconsidered identical, but in the lower level abstraction, they wouldbe non-identical. Note that afaik most high-level Unicode systemsnormally work in the highest abstraction level possible (whether theysynchronize the normal form on storage or on compare is beside thepoint), as that is what users would expect; in which case, the actualcodepoints in use would be considered non-significant, and beabstracted away from the user.

6. So as long as identity considerations are handled properly, itdoesn't matter for satisfying the relational model of data as towhether trailing spaces are significant, just as it doesn't forgraphemes vs codepoints abstraction. So then in this regard Iconsider SQLite's current approach and the SQL standard'sproscription to be equally valid.

7. So my argument about that trailing spaces should be consideredsignificant comes more down to what is considered well establishedprinciples of good language design. I would argue that if you want asimpler situation, that all the characters are significant, and thatis what most programming languages do for character string literals.

8. If one wants to argue for the merits of ignoring trailing spaces,then I would ask for what reason and why stop there? I would imaginethat a valid reason to consider said spaces insignificant is if, say,the text is meant to represent some human speech, and it is more justthat there are spaces between or around words at all that issignificant, not how many spaces. And so, in such a situation wheretrailing spaces are insignificant, I would think that having varyingamounts of space characters between words is also insignificant, andcomparisons should treat as if each word is separated by exactly onespace.

9. And so an argument for all characters being significant islargely an argument in keeping things simple, which I think in thegeneral case is what people expect. For situations where peopleexpect different, they probably expect multiple other differences inconjunction with the trailing spaces thing, such as middle or leadingspaces.

10. In the interests of useability, the base behaviour should besimpler, such as SQLite is, and special-casing strings should bebuilt on top of that base, rather than the other way around. It'sprobably a lot easier or more elegant to add special cases than toremove them. Also such as drh provided with his new collation commit.


-- Darren Duncan

P.S. As another piece of full-disclosure, I'm in the midst ofwriting the spec for an industrial-quality programming language,named Muldis D, which is intended to replace SQL as the defactolanguage of choice for relational databases. I'm also significantlyinvolved in the design of the Perl 6 language. So I have beenlooking at the relevant issues quite closely and I believe I canrationalize any arguments I make in regards to how a DBMS or aprogramming language should behave, and moreover that suchdifferences from the SQL standard are viable in the real world forreal work.


-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

RE: [sqlite] SQLite character comparisons

Reply via email to