Mike Matrigali wrote:


Daniel John Debrunner wrote:
Mike Matrigali wrote:



Rick Hillegas wrote:


Thanks, Mike. This overhead seems pretty small to me. It's hard for me to predict whether this is useful generality or over-design.

In the SQL standard, collations can be declared per column. That affects index descriptors. In addition, via CASTs, collations can be declared per sortable expression in an ORDER BY clause. That affects the sorter. I'm not the person scratching this initial itch. I just want to register my instinct to design-in the generality up front. I think this has two advantages:

1) It will remove an upgrade issue later on when someone wants to implement more of the SQL collation support.

2) It generally lowers the barrier to implementing more of the standard.

Regards,
-Rick


I am just not sure how comfortable I feel forcing an upgrade issue on a
developer for a particular feature that is not their itch. Mamta is trying to solve single collation database problem, not full SQL collation support.


There's a number of factors that come in, one is the long term maintainability of the code. I think that trumps any single developer's itch. The developer can work with the community in coming up with a solution that keeps a good balance between what the community see as maintainability and scratching their itch.

I'm actually trying to save the contributor (Mamta) work here, I think changing all the locations that generate characters to have the correct "new-character-type" is a huge amount of work and subject to errors (just from the amount of changes and interesting situations). E.g. in some situations a literal will be a CHAR (sorting by ucs_basic) and others a CHAR (sorting by locale). That decision may not be able to be made until very late in the bind time, and may not possibly even matter even thought code would have to pick one. Only caring about this when collation is involved may make it easier.

I obviously don't know "all the places", so it is not clear to me why some of the places don't have to change. It is not clear to me why one does not in the new proposal have to change all the locations that generate characters to have the correct "new-collation-type". I think
this is because I dont understand the runtime usages.  Am I at least
right about the following locations where we persist the columns.  If
we get the right info into them when we persist them, then we can get
the right info into them when we read them back.

Let me see if I can explain the general compile time situations I'm thinking about.

Assume a SQL expression where all the types are CHAR.

   f('fred', col1, col2) = col3

Now currently the bind code is going to resolve types in this order:

B1) 'fred' - CHAR
B2) col1 - CHAR
B3) col2 - CHAR
B4) f('fred', col1, col2) - CHAR
B5) col3 - CHAR
B6) result - BOOLEAN

once pass, got the right result! :-)

So with the proposed dual type system we have the two internal character types for CHAR:

CHAR(locale) - CHAR with collation for user columns
CHAR(ucs_basic) - CHAR collation for UCS_BASIC for system columns

Now in the proposed dual type system, the same bind ordering will occur.

Let's assume all col1 & col2 are user columns.

So the bind will result in

B1) 'fred' - unknown
B2) col1 - CHAR(locale)
B3) col2 - CHAR(locale)
B4) f('fred', col1, col2) - unknown
B5) col3 - CHAR(locale)
B6) result - unknown - don't know if types can be compared.

So come the end of bind time we haven't resolved the types,
so a second bind phase would be needed, which doesn't exist at the moment.

So what would that second phase do?

Bii1) 'fred' still unknown, no good reason to pick either type. Could base it on the other arguments but what if no other character arguments or other character arguments are a mix of CHAR(locale) and CHAR(ucs_basic)? Bii2) f('fred', col1, col2) - unknown, don't know how to look up the function without a type
Bii3) result - unknown  - don't know if types can be compared.

Whoops, no progress, fail query.

Now one could introduce a partial type, or a third CHAR type - CHAR(unknown), which might solve the problem. First phase would be:

B1) 'fred' - CHAR(unknown)
B2) col1 - CHAR(locale)
B3) col2 - CHAR(locale)
B4) f('fred', col1, col2) - CHAR(unknown)
B5) col3 - CHAR(locale)
B6) result - BOOLEAN

Hmmmm, so I've got the right result but I've been left with a series of CHAR(unknown) in the tree, four options: 1) add extra bind phases to resolve them, but I think this will fail for the same reasons where we didn't have CHAR(unknown). 2) make CHAR(unknown) a first class internal type, fully supported at runtime and compile time. 3) resolve them in a second bind phase to CHAR(ucs_basic), doesn't work because can't compare across collations 4) resolve them in a smarter second bind phase where the unknown types are converted to the matching type in a collation operator and CHAR(ucs_basic) elsewhere.

So a possible solution, but note a third character type has been added for CHAR and either needs to fully implement an internal type to make sure it works in the compile and execution system or I need a second bind phase.

To me this seems like it's heading off in the direction of a hack, a third character internal type for CHAR? Hacks lead to bugs, bugs lead to the dark side :-)

-------------------------------------

So what about having collation as an attribute of a character type, then we have:

B1) 'fred' - CHAR collation=unknown
B2) col1 - CHAR collation=locale
B3) col2 - CHAR collation=locale
B4) f('fred', col1, col2) - CHAR collation=unknown
B5) col3 - CHAR collation=locale
B6) result - CHAR collation=locale

First pass got the right result, just like the three type CHAR(unknown) case. :-) What about those unknown collations, no problem, I know that if a collation was unknown then that information is not needed, and since they are just the standard CHAR type I already know that works at compile and execute time.

[in both the compare unknown collation to known collation the compiler would generate code to execute the comparison using the known collation. This may happen automatically using the precedence system or compiler inserts some promote code, which possibly goes back to the some of the methods I proposed earlier. ]

-----------------------------------------------
Ok, double check, let's make the expression

   f('fred', col1, syscol2) = syscol3

Of course the current code is going to bind everything to CHAR.

let's try the multi-type system with CHAR(unknown)

B1) 'fred' - CHAR(unknown)
B2) col1 - CHAR(locale)
B3) syscol2- CHAR(ucs_basic)
B4) f('fred', col1, syscol2) - CHAR(unknown)
B5) syscol3 - CHAR(ucs_basic)
B6) result - BOOLEAN

and similar for the attribute case

B1) 'fred' - CHAR collation=unknown
B2) col1 - CHAR collation=locale
B3) syscol2- CHAR collation=ucs_basic
B4) f('fred', col1, syscol2) - CHAR collation=unknown
B5) syscol3 - CHAR collation=ucs_basic
B6) result - BOOLEAN

---------------------------------------------------
and in case you are wondering why a literal such as 'fred' or a function return doesn't just resolve to CHAR(locale), consider these examples:

'fred' = syscol1 - would fail, can't compare across collations, but database meta data queries depend on this behaviour

  'fred' = col1

---------------------------------------------------

and just for kicks

'fred' = 'barney'

Both are unknown collation types, in this case I think the result of collation would the default user type, collation=locale.

Sorry for the long e-mail.

Hope this is clear.
Dan.





Reply via email to