[ 
https://issues.apache.org/jira/browse/DRILL-5477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-5477:
------------------------------------
    Description: 
Drill string functions lower / upper / initcap work only for ASCII, but not for 
UTF-8. UTF-8 is a multi-byte code that requires special encoding/decoding to 
convert to Unicode characters. Without that encoding, these functions won't 
work for Cyrillic, Greek or any other character set with upper/lower 
distinctions.

Currently, when user applies these functions for UTF-8, Drill returns the same 
value as was given.
Example:
{noformat}
select upper('привет') from (values(1)) -> привет
{noformat}

There is disabled unit test in 
https://github.com/arina-ielchiieva/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestStringFunctions.java#L33
 which should be enabled once issue is fixed.

Please note, by default Calcite does not allow to use UTF-8. Update system 
property *saffron.default.charset* to *UTF-16LE* if you encounter the following 
error:
{noformat}
org.apache.drill.exec.rpc.RpcException: 
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
CalciteException: Failed to encode 'привет' in character set 'ISO-8859-1'
{noformat}



  was:
Drill string functions lower / upper / initcap work only for ASCII, but not for 
UTF-8. UTF-8 is a multi-byte code that requires special encoding/decoding to 
convert to Unicode characters. Without that encoding, these functions won't 
work for Cyrillic, Greek or any other character set with upper/lower 
distinctions.

Currently, when user applies these functions for UTF-8, Drill returns the same 
value as was given.
Example:
{noformat}
select upper('привет') from (values(1)) -> привет
{noformat}

There is disabled unit test in 
https://github.com/arina-ielchiieva/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestStringFunctions.java#L33
 which should be enabled once issue is fixed.


> String functions (lower, upper, initcap) should work for UTF-8
> --------------------------------------------------------------
>
>                 Key: DRILL-5477
>                 URL: https://issues.apache.org/jira/browse/DRILL-5477
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Functions - Drill
>    Affects Versions: 1.10.0
>            Reporter: Arina Ielchiieva
>
> Drill string functions lower / upper / initcap work only for ASCII, but not 
> for UTF-8. UTF-8 is a multi-byte code that requires special encoding/decoding 
> to convert to Unicode characters. Without that encoding, these functions 
> won't work for Cyrillic, Greek or any other character set with upper/lower 
> distinctions.
> Currently, when user applies these functions for UTF-8, Drill returns the 
> same value as was given.
> Example:
> {noformat}
> select upper('привет') from (values(1)) -> привет
> {noformat}
> There is disabled unit test in 
> https://github.com/arina-ielchiieva/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestStringFunctions.java#L33
>  which should be enabled once issue is fixed.
> Please note, by default Calcite does not allow to use UTF-8. Update system 
> property *saffron.default.charset* to *UTF-16LE* if you encounter the 
> following error:
> {noformat}
> org.apache.drill.exec.rpc.RpcException: 
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
> CalciteException: Failed to encode 'привет' in character set 'ISO-8859-1'
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to