Mike Matrigali wrote:
Daniel John Debrunner wrote:
Mike Matrigali wrote:
Rick Hillegas wrote:
Thanks, Mike. This overhead seems pretty small to me. It's hard for
me to predict whether this is useful generality or over-design.
In the SQL standard, collations can be declared per column. That
affects index descriptors. In addition, via CASTs, collations can
be declared per sortable expression in an ORDER BY clause. That
affects the sorter. I'm not the person scratching this initial
itch. I just want to register my instinct to design-in the
generality up front. I think this has two advantages:
1) It will remove an upgrade issue later on when someone wants to
implement more of the SQL collation support.
2) It generally lowers the barrier to implementing more of the
standard.
Regards,
-Rick
I am just not sure how comfortable I feel forcing an upgrade issue on a
developer for a particular feature that is not their itch. Mamta is
trying to solve single collation database problem, not full SQL
collation support.
There's a number of factors that come in, one is the long term
maintainability of the code. I think that trumps any single
developer's itch. The developer can work with the community in coming
up with a solution that keeps a good balance between what the
community see as maintainability and scratching their itch.
I'm actually trying to save the contributor (Mamta) work here, I think
changing all the locations that generate characters to have the
correct "new-character-type" is a huge amount of work and subject to
errors (just from the amount of changes and interesting situations).
E.g. in some situations a literal will be a CHAR (sorting by
ucs_basic) and others a CHAR (sorting by locale). That decision may
not be able to be made until very late in the bind time, and may not
possibly even matter even thought code would have to pick one. Only
caring about this when collation is involved may make it easier.
I obviously don't know "all the places", so it is not clear to me why
some of the places don't have to change. It is not clear to me why one
does not in the new proposal have to change all the locations that
generate characters to have the correct "new-collation-type". I think
this is because I dont understand the runtime usages. Am I at least
right about the following locations where we persist the columns. If
we get the right info into them when we persist them, then we can get
the right info into them when we read them back.
Let me see if I can explain the general compile time situations I'm
thinking about.
Assume a SQL expression where all the types are CHAR.
f('fred', col1, col2) = col3
Now currently the bind code is going to resolve types in this order:
B1) 'fred' - CHAR
B2) col1 - CHAR
B3) col2 - CHAR
B4) f('fred', col1, col2) - CHAR
B5) col3 - CHAR
B6) result - BOOLEAN
once pass, got the right result! :-)
So with the proposed dual type system we have the two internal character
types for CHAR:
CHAR(locale) - CHAR with collation for user columns
CHAR(ucs_basic) - CHAR collation for UCS_BASIC for system columns
Now in the proposed dual type system, the same bind ordering will occur.
Let's assume all col1 & col2 are user columns.
So the bind will result in
B1) 'fred' - unknown
B2) col1 - CHAR(locale)
B3) col2 - CHAR(locale)
B4) f('fred', col1, col2) - unknown
B5) col3 - CHAR(locale)
B6) result - unknown - don't know if types can be compared.
So come the end of bind time we haven't resolved the types,
so a second bind phase would be needed, which doesn't exist at the moment.
So what would that second phase do?
Bii1) 'fred' still unknown, no good reason to pick either type. Could
base it on the other arguments but what if no other character arguments
or other character arguments are a mix of CHAR(locale) and CHAR(ucs_basic)?
Bii2) f('fred', col1, col2) - unknown, don't know how to look up the
function without a type
Bii3) result - unknown - don't know if types can be compared.
Whoops, no progress, fail query.
Now one could introduce a partial type, or a third CHAR type -
CHAR(unknown), which might solve the problem. First phase would be:
B1) 'fred' - CHAR(unknown)
B2) col1 - CHAR(locale)
B3) col2 - CHAR(locale)
B4) f('fred', col1, col2) - CHAR(unknown)
B5) col3 - CHAR(locale)
B6) result - BOOLEAN
Hmmmm, so I've got the right result but I've been left with a series of
CHAR(unknown) in the tree, four options:
1) add extra bind phases to resolve them, but I think this will fail
for the same reasons where we didn't have CHAR(unknown).
2) make CHAR(unknown) a first class internal type, fully supported at
runtime and compile time.
3) resolve them in a second bind phase to CHAR(ucs_basic), doesn't
work because can't compare across collations
4) resolve them in a smarter second bind phase where the unknown
types are converted to the matching type in a collation operator and
CHAR(ucs_basic) elsewhere.
So a possible solution, but note a third character type has been added
for CHAR and either needs to fully implement an internal type to make
sure it works in the compile and execution system or I need a second
bind phase.
To me this seems like it's heading off in the direction of a hack, a
third character internal type for CHAR? Hacks lead to bugs, bugs lead to
the dark side :-)
-------------------------------------
So what about having collation as an attribute of a character type, then
we have:
B1) 'fred' - CHAR collation=unknown
B2) col1 - CHAR collation=locale
B3) col2 - CHAR collation=locale
B4) f('fred', col1, col2) - CHAR collation=unknown
B5) col3 - CHAR collation=locale
B6) result - CHAR collation=locale
First pass got the right result, just like the three type CHAR(unknown)
case. :-) What about those unknown collations, no problem, I know that
if a collation was unknown then that information is not needed, and
since they are just the standard CHAR type I already know that works at
compile and execute time.
[in both the compare unknown collation to known collation the compiler
would generate code to execute the comparison using the known collation.
This may happen automatically using the precedence system or compiler
inserts some promote code, which possibly goes back to the some of the
methods I proposed earlier. ]
-----------------------------------------------
Ok, double check, let's make the expression
f('fred', col1, syscol2) = syscol3
Of course the current code is going to bind everything to CHAR.
let's try the multi-type system with CHAR(unknown)
B1) 'fred' - CHAR(unknown)
B2) col1 - CHAR(locale)
B3) syscol2- CHAR(ucs_basic)
B4) f('fred', col1, syscol2) - CHAR(unknown)
B5) syscol3 - CHAR(ucs_basic)
B6) result - BOOLEAN
and similar for the attribute case
B1) 'fred' - CHAR collation=unknown
B2) col1 - CHAR collation=locale
B3) syscol2- CHAR collation=ucs_basic
B4) f('fred', col1, syscol2) - CHAR collation=unknown
B5) syscol3 - CHAR collation=ucs_basic
B6) result - BOOLEAN
---------------------------------------------------
and in case you are wondering why a literal such as 'fred' or a function
return doesn't just resolve to CHAR(locale), consider these examples:
'fred' = syscol1 - would fail, can't compare across collations, but
database meta data queries depend on this behaviour
'fred' = col1
---------------------------------------------------
and just for kicks
'fred' = 'barney'
Both are unknown collation types, in this case I think the result of
collation would the default user type, collation=locale.
Sorry for the long e-mail.
Hope this is clear.
Dan.