Re: [DISCUSS] upper/lower of special characters

Reynold Xin Tue, 18 Sep 2018 23:18:27 -0700

I'd just document it as a known limitation and move on for now, until there
are enough end users that need this. Spark is also very powerful with UDFs
and end users can easily work around this using UDFs.


--
excuse the brevity and lower case due to wrist injury


On Tue, Sep 18, 2018 at 11:14 PM seancxmao <seancx...@gmail.com> wrote:

> Hi, all
>
> We found that there are some differences about case handling of special
> characters between Spark and other database systems. You may see blow list
> for an example (you may also check attached pictures)
>
> select upper("i"), lower("İ"), upper("ı"), lower("I");
> ------------------------------------------------------
> Spark      I, i with dot, I, i
> Hive       I, i with dot, I, i
> Teradata   I, i,          I, i
> Oracle     I, i,          I, i
> SQLServer  I, i,          I, i
> MySQL      I, i,          I, i
>
> "İ" and "ı" are Turkish characters. If locale-sensitive case handling is
> used, the expected results of above upper/lower functions should be:
> select upper("i"), lower("İ"), upper("ı"), lower("I");
> ------------------------------------------------------
> İ, i, I, ı
>
> But, it seems that these systems all do local-insensitive mapping. Presto
> explicitly describe this as a known issue in their docs (
> https://prestodb.io/docs/current/functions/string.html)
> > The lower() and upper() functions do not perform locale-sensitive,
> context-sensitive, or one-to-many mappings required for some languages.
> Specifically, this will return incorrect results for Lithuanian, Turkish
> and Azeri.
>
> Java besed systems have same behaviors since they all depend on the same
> JDK String methods. Teradata/Oracle/SQLServer/MySQL also have same
> behaviors. However Java based systems return different results for lower(
> "İ"). Java based systems (Spark/Hive) return "i with dot" while other
> database systems(Teradata/Oracle/SQLServer/MySQL) return "i".
>
> My questions:
> (1) Should we let Spark return "i" for lower("İ"), which is same as other
> database systems?
> (2) Should Spark support locale-sensitive upper/lower functions? Because
> row of a table may need different locales, we cannot even set locale at
> table level. What we might do is to provide upper(string,
> locale)/lower(string, locale), and let users decide what locale they want
> to use.
>
> Some references below. Just FYI.
>
> *
> https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toLowerCase-java.util.Locale-
> *
> https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toUpperCase-java.util.Locale-
> * http://grepalex.com/2013/02/14/java-7-and-the-dotted--and-dotless-i/
> *
> https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-lette
>
> Your comments and advices are highly appreciated.
>
> Many thanks!
> Chenxiao Mao (Sean)
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] upper/lower of special characters

Reply via email to