Re: Collation implementation WAS Re: Should COLLATION attribute related code go in BasicDatabase?

Daniel John Debrunner Thu, 15 Mar 2007 16:35:23 -0800

Mike Matrigali wrote:

Daniel John Debrunner wrote:
Mike Matrigali wrote:
Rick Hillegas wrote:
Thanks, Mike. This overhead seems pretty small to me. It's hard forme to predict whether this is useful generality or over-design.
In the SQL standard, collations can be declared per column. Thataffects index descriptors. In addition, via CASTs, collations canbe declared per sortable expression in an ORDER BY clause. Thataffects the sorter. I'm not the person scratching this initialitch. I just want to register my instinct to design-in thegenerality up front. I think this has two advantages:
1) It will remove an upgrade issue later on when someone wants toimplement more of the SQL collation support.
2) It generally lowers the barrier to implementing more of thestandard.
Regards,
-Rick
I am just not sure how comfortable I feel forcing an upgrade issue on a
developer for a particular feature that is not their itch. Mamta istrying to solve single collation database problem, not full SQLcollation support.
There's a number of factors that come in, one is the long termmaintainability of the code. I think that trumps any singledeveloper's itch. The developer can work with the community in comingup with a solution that keeps a good balance between what thecommunity see as maintainability and scratching their itch.
I'm actually trying to save the contributor (Mamta) work here, I thinkchanging all the locations that generate characters to have thecorrect "new-character-type" is a huge amount of work and subject toerrors (just from the amount of changes and interesting situations).E.g. in some situations a literal will be a CHAR (sorting byucs_basic) and others a CHAR (sorting by locale). That decision maynot be able to be made until very late in the bind time, and may notpossibly even matter even thought code would have to pick one. Onlycaring about this when collation is involved may make it easier.
I obviously don't know "all the places", so it is not clear to me whysome of the places don't have to change. It is not clear to me why onedoes not in the new proposal have to change all the locations thatgenerate characters to have the correct "new-collation-type". I think
this is because I dont understand the runtime usages.  Am I at least
right about the following locations where we persist the columns.  If
we get the right info into them when we persist them, then we can get
the right info into them when we read them back.

Let me see if I can explain the general compile time situations I'mthinking about.


Assume a SQL expression where all the types are CHAR.

   f('fred', col1, col2) = col3

Now currently the bind code is going to resolve types in this order:

B1) 'fred' - CHAR
B2) col1 - CHAR
B3) col2 - CHAR
B4) f('fred', col1, col2) - CHAR
B5) col3 - CHAR
B6) result - BOOLEAN

once pass, got the right result! :-)

So with the proposed dual type system we have the two internal charactertypes for CHAR:


CHAR(locale) - CHAR with collation for user columns
CHAR(ucs_basic) - CHAR collation for UCS_BASIC for system columns

Now in the proposed dual type system, the same bind ordering will occur.

Let's assume all col1 & col2 are user columns.

So the bind will result in

B1) 'fred' - unknown
B2) col1 - CHAR(locale)
B3) col2 - CHAR(locale)
B4) f('fred', col1, col2) - unknown
B5) col3 - CHAR(locale)
B6) result - unknown - don't know if types can be compared.

So come the end of bind time we haven't resolved the types,
so a second bind phase would be needed, which doesn't exist at the moment.

So what would that second phase do?

Bii1) 'fred' still unknown, no good reason to pick either type. Couldbase it on the other arguments but what if no other character argumentsor other character arguments are a mix of CHAR(locale) and CHAR(ucs_basic)?Bii2) f('fred', col1, col2) - unknown, don't know how to look up thefunction without a type

Bii3) result - unknown  - don't know if types can be compared.

Whoops, no progress, fail query.

Now one could introduce a partial type, or a third CHAR type -CHAR(unknown), which might solve the problem. First phase would be:


B1) 'fred' - CHAR(unknown)
B2) col1 - CHAR(locale)
B3) col2 - CHAR(locale)
B4) f('fred', col1, col2) - CHAR(unknown)
B5) col3 - CHAR(locale)
B6) result - BOOLEAN

Hmmmm, so I've got the right result but I've been left with a series ofCHAR(unknown) in the tree, four options:1) add extra bind phases to resolve them, but I think this will failfor the same reasons where we didn't have CHAR(unknown).2) make CHAR(unknown) a first class internal type, fully supported atruntime and compile time.3) resolve them in a second bind phase to CHAR(ucs_basic), doesn'twork because can't compare across collations4) resolve them in a smarter second bind phase where the unknowntypes are converted to the matching type in a collation operator andCHAR(ucs_basic) elsewhere.

So a possible solution, but note a third character type has been addedfor CHAR and either needs to fully implement an internal type to makesure it works in the compile and execution system or I need a secondbind phase.

To me this seems like it's heading off in the direction of a hack, athird character internal type for CHAR? Hacks lead to bugs, bugs lead tothe dark side :-)


-------------------------------------

So what about having collation as an attribute of a character type, thenwe have:


B1) 'fred' - CHAR collation=unknown
B2) col1 - CHAR collation=locale
B3) col2 - CHAR collation=locale
B4) f('fred', col1, col2) - CHAR collation=unknown
B5) col3 - CHAR collation=locale
B6) result - CHAR collation=locale

First pass got the right result, just like the three type CHAR(unknown)case. :-) What about those unknown collations, no problem, I know thatif a collation was unknown then that information is not needed, andsince they are just the standard CHAR type I already know that works atcompile and execute time.

[in both the compare unknown collation to known collation the compilerwould generate code to execute the comparison using the known collation.This may happen automatically using the precedence system or compilerinserts some promote code, which possibly goes back to the some of themethods I proposed earlier. ]


-----------------------------------------------
Ok, double check, let's make the expression

   f('fred', col1, syscol2) = syscol3

Of course the current code is going to bind everything to CHAR.

let's try the multi-type system with CHAR(unknown)

B1) 'fred' - CHAR(unknown)
B2) col1 - CHAR(locale)
B3) syscol2- CHAR(ucs_basic)
B4) f('fred', col1, syscol2) - CHAR(unknown)
B5) syscol3 - CHAR(ucs_basic)
B6) result - BOOLEAN

and similar for the attribute case

B1) 'fred' - CHAR collation=unknown
B2) col1 - CHAR collation=locale
B3) syscol2- CHAR collation=ucs_basic
B4) f('fred', col1, syscol2) - CHAR collation=unknown
B5) syscol3 - CHAR collation=ucs_basic
B6) result - BOOLEAN

---------------------------------------------------

and in case you are wondering why a literal such as 'fred' or a functionreturn doesn't just resolve to CHAR(locale), consider these examples:

'fred' = syscol1 - would fail, can't compare across collations, butdatabase meta data queries depend on this behaviour


  'fred' = col1

---------------------------------------------------

and just for kicks

'fred' = 'barney'

Both are unknown collation types, in this case I think the result ofcollation would the default user type, collation=locale.


Sorry for the long e-mail.

Hope this is clear.
Dan.

Re: Collation implementation WAS Re: Should COLLATION attribute related code go in BasicDatabase?

Reply via email to