Thanks for the comments, please find my replies inline. On Fri, Nov 16, 2018 at 2:06 AM Julian Hyde <jh...@apache.org> wrote:
> Looks great - thank you for writing this. I have some questions. If they > are already answered in your document, forgive me, and just say “That’s > answered in the document." > > I very much like the idea of adding default charset and collation to > RelDataTypeSystem. This will help to carry them to all points in the code > where they are needed. > > I also like the idea of adding charset and collation as table options. It > seems that this feature is non-essential, and could be done in phase 2, if > necessary. Also, it mainly applies to SQL DDL, i.e. the “server” module. I > don’t we need to add default charset and collation to the Table or > RelOptTable interfaces, just SqlCreateTable. > Agreed. > > Regarding the column options. Could charset and collection not be > specified as part of the column’s data type? > Yes, if not specified, column charset is deduced from table default, or session default, or system default charset. > > When we are parsing a SQL character literal, the characters of that > literal are in the same encoding as the SQL string itself. The parser (see > the line ‘UNICODE_INPUT = true;' generated Parser.jj file) seems to assume > that input is unicode. That seems fine to me — do you agree? > Agreed. By fixing 'core charset' be UTF-16, we have better performance and lower coding effort. > > Unqualified character literals (e.g. ‘hello’, vs qualified _UTF8’hello’) > are always UTF16. Is that correct? Should we provide a way to change that > default? Do any major databases provide a way to change that default? > > IMO unqualified characters should have default charset, instead of treating 'hello' as _UTF16'hello', it is more convenient to treat it as _${DEFAULT_CHARSET}'hello', where DEFAULT_CHARSET is defined by session/system configuration (connection/startup configuration in mysql https://goo.gl/67hOXK , or SqlSetOption in Calcite) or type system. In a scenario where different columns have different charsets/collations, I > assume that there will be a lot of implicit conversion going on. (Not to > mention explicit conversion, using CONVERT.) Are there concerns about this? > Are the rules well-defined if, say we compare a UTF8 with a UTF16 string, > or concatenate a UTF8 with a UTF16 string? > There may be concerns. I've already found 2 points, 1. Sql Function return type inference. 2. RelDataFactory#leastRestrictive >From which there may impacts on rules like ReduceExpressionRules. > > I saw that MySQL has problems with 3-byte utf8 (aka utf8mb3) and 4-byte > utf8mb4[1]. Are we going to avoid those problems? > I'm not sure but Java UTF-8 encoder/decoder look good. > > Julian > > [1] https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html > > > > > On Nov 15, 2018, at 4:13 AM, Ted Xu <frank...@gmail.com> wrote: > > > > Hi folks, > > > > I created a design doc > > > https://docs.google.com/document/d/1wo5byn_6K_YOKiPdXNav1zgzt9IBC3SbPvpPnIShtXk/edit?usp=sharing > > for supporting charset in calcite, per previous discussions on this > topic. > > > > One thing I'm not sure is runtime (Codegen on Enumerable and RelExecutor > > etc) change. Since I/O is decoupled by pluggable points like > > Schemas#enumerable, that part looks good to me already. > > > > I'm sure there are a lot misunderstandings and missing pieces in that doc > > above, please feel free to leave comments. > >