Quanlong Huang has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17785


Change subject: WIP IMPALA-2019(part-3): Add UTF-8 support for case conversion 
functions
......................................................................

WIP IMPALA-2019(part-3): Add UTF-8 support for case conversion functions

There are 3 builtin string functions doing case conversion: upper,
lower, and initcap. Previously they only convert English alphabetic
characters. This patch adds support to deal with unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

This patch currently uses Boost.Locale in case conversion.
ICU(International Components for Unicode) is not integrated yet since
our boost in native-toolchain is not built with ICU. So currently the
localization backend of Boost.Locale is iconv, and the above corner
cases are not handled. We will consider integrating ICU in a follow-up
JIRA.

TODO: Add query option to specify the locale

Test:
 - Add BE unit tests
 - TODO: add more tests

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
3 files changed, 105 insertions(+), 15 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/1
--
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <[email protected]>

Reply via email to