I don't have the details in front of me, but I recall we explicitly overhauled locale-sensitive toUpper and toLower in the code for this exact situation. The current behavior should be on purpose. I believe user data strings are handled in a case sensitive way but things like reserved words in SQL are not of course. The Spark behavior is most correct and consistent with Hive, right?
On Wed, Sep 19, 2018, 1:14 AM seancxmao <seancx...@gmail.com> wrote: > Hi, all > > We found that there are some differences about case handling of special > characters between Spark and other database systems. You may see blow list > for an example (you may also check attached pictures) > > select upper("i"), lower("İ"), upper("ı"), lower("I"); > ------------------------------------------------------ > Spark I, i with dot, I, i > Hive I, i with dot, I, i > Teradata I, i, I, i > Oracle I, i, I, i > SQLServer I, i, I, i > MySQL I, i, I, i > > "İ" and "ı" are Turkish characters. If locale-sensitive case handling is > used, the expected results of above upper/lower functions should be: > select upper("i"), lower("İ"), upper("ı"), lower("I"); > ------------------------------------------------------ > İ, i, I, ı > > But, it seems that these systems all do local-insensitive mapping. Presto > explicitly describe this as a known issue in their docs ( > https://prestodb.io/docs/current/functions/string.html) > > The lower() and upper() functions do not perform locale-sensitive, > context-sensitive, or one-to-many mappings required for some languages. > Specifically, this will return incorrect results for Lithuanian, Turkish > and Azeri. > > Java besed systems have same behaviors since they all depend on the same > JDK String methods. Teradata/Oracle/SQLServer/MySQL also have same > behaviors. However Java based systems return different results for lower( > "İ"). Java based systems (Spark/Hive) return "i with dot" while other > database systems(Teradata/Oracle/SQLServer/MySQL) return "i". > > My questions: > (1) Should we let Spark return "i" for lower("İ"), which is same as other > database systems? > (2) Should Spark support locale-sensitive upper/lower functions? Because > row of a table may need different locales, we cannot even set locale at > table level. What we might do is to provide upper(string, > locale)/lower(string, locale), and let users decide what locale they want > to use. > > Some references below. Just FYI. > > * > https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toLowerCase-java.util.Locale- > * > https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toUpperCase-java.util.Locale- > * http://grepalex.com/2013/02/14/java-7-and-the-dotted--and-dotless-i/ > * > https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-lette > > Your comments and advices are highly appreciated. > > Many thanks! > Chenxiao Mao (Sean) > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org