Looks great - thank you for writing this. I have some questions. If they are 
already answered in your document, forgive me, and just say “That’s answered in 
the document."

I very much like the idea of adding default charset and collation to 
RelDataTypeSystem. This will help to carry them to all points in the code where 
they are needed.

I also like the idea of adding charset and collation as table options. It seems 
that this feature is non-essential, and could be done in phase 2, if necessary. 
Also, it mainly applies to SQL DDL, i.e. the “server” module. I don’t we need 
to add default charset and collation to the Table or RelOptTable interfaces, 
just SqlCreateTable.

Regarding the column options. Could charset and collection not be specified as 
part of the column’s data type?

When we are parsing a SQL character literal, the characters of that literal are 
in the same encoding as the SQL string itself. The parser (see the line 
‘UNICODE_INPUT = true;' generated Parser.jj file) seems to assume that input is 
unicode. That seems fine to me — do you agree?

Unqualified character literals (e.g. ‘hello’, vs qualified _UTF8’hello’) are 
always UTF16. Is that correct? Should we provide a way to change that default? 
Do any major databases provide a way to change that default?

In a scenario where different columns have different charsets/collations, I 
assume that there will be a lot of implicit conversion going on. (Not to 
mention explicit conversion, using CONVERT.) Are there concerns about this? Are 
the rules well-defined if, say we compare a UTF8 with a UTF16 string, or 
concatenate a UTF8 with a UTF16 string?

I saw that MySQL has problems with 3-byte utf8 (aka utf8mb3) and 4-byte 
utf8mb4[1]. Are we going to avoid those problems?

Julian

[1] https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html



> On Nov 15, 2018, at 4:13 AM, Ted Xu <[email protected]> wrote:
> 
> Hi folks,
> 
> I created a design doc
> https://docs.google.com/document/d/1wo5byn_6K_YOKiPdXNav1zgzt9IBC3SbPvpPnIShtXk/edit?usp=sharing
> for supporting charset in calcite, per previous discussions on this topic.
> 
> One thing I'm not sure is runtime (Codegen on Enumerable and RelExecutor
> etc) change. Since I/O is decoupled by pluggable points like
> Schemas#enumerable, that part looks good to me already.
> 
> I'm sure there are a lot misunderstandings and missing pieces in that doc
> above, please feel free to leave comments.

Reply via email to