[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238173#comment-16238173 ] Hudson commented on PHOENIX-4237: - SUCCESS: Integrated in Jenkins build Phoenix-master #1865 (See [https://builds.apache.org/job/Phoenix-master/1865/]) PHOENIX-4237 Allow sorting on (Java) collation keys for non-English (jtaylor: rev ee4355791acf3f31568fcd8c43367947d25a1386) * (add) phoenix-core/src/it/java/org/apache/phoenix/end2end/CollationKeyFunctionIT.java * (edit) phoenix-core/src/main/java/org/apache/phoenix/expression/ExpressionType.java * (add) phoenix-core/src/test/java/org/apache/phoenix/expression/function/CollationKeyFunctionTest.java * (edit) phoenix-core/src/main/java/org/apache/phoenix/jdbc/PhoenixConnection.java * (add) phoenix-core/src/main/java/org/apache/phoenix/util/VarBinaryFormatter.java * (edit) LICENSE * (edit) phoenix-server/pom.xml * (add) phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java * (edit) phoenix-core/pom.xml > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda >Assignee: Shehzaad Nakhoda >Priority: Major > Fix For: 4.13.0 > > Attachments: PHOENIX-4237_v1.patch, PHOENIX-4237_v2.patch, > PHOENIX-4237_v3.patch > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237834#comment-16237834 ] James Taylor commented on PHOENIX-4237: --- +1. Great work, [~shehzaadn]! > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda >Assignee: Shehzaad Nakhoda >Priority: Major > Fix For: 4.12.0 > > Attachments: PHOENIX-4237_v1.patch, PHOENIX-4237_v2.patch, > PHOENIX-4237_v3.patch > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237360#comment-16237360 ] Hadoop QA commented on PHOENIX-4237: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12895584/PHOENIX-4237_v3.patch against master branch at commit 1e48eabe4cbf72ce71fb0dbdd6053a9600133ee4. ATTACHMENT ID: 12895584 {color:red}-1 @author{color}. The patch appears to contain 1 @author tags which the Hadoop community has agreed to not allow in code contributions. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: + queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW", 0, 6, new Integer[] { 0, 3, 4, 1, 5, 2, 6 }); + queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW_STROKE", 0, 6, new Integer[] { 4, 2, 0, 3, 1, 6, 5 }); + queryWithCollKeyDefaultArgsWithExpectedOrder("zh__STROKE", 0, 6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 }); + queryWithCollKeyDefaultArgsWithExpectedOrder("zh__PINYIN", 0, 6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 }); + queryWithCollKeyUpperCaseWithExpectedOrder("en", 7, 13, new Integer[] { 7, 10, 11, 13, 9, 12, 8 }); + private void queryWithCollKeyDefaultArgsWithExpectedOrder(String localeString, Integer beginIndex, Integer endIndex, + "SELECT id, data FROM %s WHERE ID BETWEEN %d AND %d ORDER BY COLLATION_KEY(data, '%s')", tableName, + private void queryWithCollKeyUpperCaseWithExpectedOrder(String localeString, Integer beginIndex, Integer endIndex, + "SELECT id, data FROM %s WHERE ID BETWEEN %d AND %d ORDER BY COLLATION_KEY(data, '%s', true), id", + private void queryWithCollKeyWithStrengthWithExpectedOrder(String localeString, Integer strength, boolean isDescending, {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexFailureIT Test results: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1614//testReport/ Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1614//console This message is automatically generated. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda >Assignee: Shehzaad Nakhoda >Priority: Major > Fix For: 4.12.0 > > Attachments: PHOENIX-4237_v1.patch, PHOENIX-4237_v2.patch, > PHOENIX-4237_v3.patch > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213562#comment-16213562 ] Hadoop QA commented on PHOENIX-4237: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12893345/PHOENIX-4237_v2.patch against master branch at commit 7cdcb2313b08d2eaeb775f0c989642f8d416cfb6. ATTACHMENT ID: 12893345 {color:red}-1 @author{color}. The patch appears to contain 18 @author tags which the Hadoop community has agreed to not allow in code contributions. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 release audit{color}. The applied patch generated 60 release audit warnings (more than the master's current 0 warnings). {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: + queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW", 0, 6, new Integer[] { 0, 3, 4, 1, 5, 2, 6 }); + queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW_STROKE", 0, 6, new Integer[] { 4, 2, 0, 3, 1, 6, 5 }); + queryWithCollKeyDefaultArgsWithExpectedOrder("zh__STROKE", 0, 6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 }); + queryWithCollKeyDefaultArgsWithExpectedOrder("zh__PINYIN", 0, 6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 }); + queryWithCollKeyUpperCaseWithExpectedOrder("en", 7, 13, new Integer[] { 7, 10, 11, 13, 9, 12, 8 }); + private void queryWithCollKeyDefaultArgsWithExpectedOrder(String localeString, Integer beginIndex, Integer endIndex, + "SELECT id, data FROM %s WHERE ID BETWEEN %d AND %d ORDER BY COLLATION_KEY(data, '%s')", tableName, + private void queryWithCollKeyUpperCaseWithExpectedOrder(String localeString, Integer beginIndex, Integer endIndex, + "SELECT id, data FROM %s WHERE ID BETWEEN %d AND %d ORDER BY COLLATION_KEY(data, '%s', true), id", + private void queryWithCollKeyWithStrengthWithExpectedOrder(String localeString, Integer strength, boolean isDescending, {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ReadIsolationLevelIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SetPropertyOnEncodedTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ConcurrentMutationsIT Test results: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1565//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1565//artifact/patchprocess/patchReleaseAuditWarnings.txt Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1565//console This message is automatically generated. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda >Assignee: Shehzaad Nakhoda > Fix For: 4.12.0 > > Attachments: PHOENIX-4237_v1.patch, PHOENIX-4237_v2.patch > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213121#comment-16213121 ] Hadoop QA commented on PHOENIX-4237: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12893307/PHOENIX-4237_v1.patch against master branch at commit 7cdcb2313b08d2eaeb775f0c989642f8d416cfb6. ATTACHMENT ID: 12893307 {color:red}-1 @author{color}. The patch appears to contain 17 @author tags which the Hadoop community has agreed to not allow in code contributions. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: + queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW", 0, 6, new Integer[] { 0, 3, 4, 1, 5, 2, 6 }); + queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW_STROKE", 0, 6, new Integer[] { 4, 2, 0, 3, 1, 6, 5 }); + queryWithCollKeyDefaultArgsWithExpectedOrder("zh__STROKE", 0, 6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 }); + queryWithCollKeyDefaultArgsWithExpectedOrder("zh__PINYIN", 0, 6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 }); + queryWithCollKeyUpperCaseWithExpectedOrder("en", 7, 13, new Integer[] { 7, 10, 11, 13, 9, 12, 8 }); + private void queryWithCollKeyDefaultArgsWithExpectedOrder(String localeString, Integer beginIndex, Integer endIndex, + "SELECT id, data FROM %s WHERE ID BETWEEN %d AND %d ORDER BY COLLATION_KEY(data, '%s')", tableName, + private void queryWithCollKeyUpperCaseWithExpectedOrder(String localeString, Integer beginIndex, Integer endIndex, + "SELECT id, data FROM %s WHERE ID BETWEEN %d AND %d ORDER BY COLLATION_KEY(data, '%s', true), id", + private void queryWithCollKeyWithStrengthWithExpectedOrder(String localeString, Integer strength, boolean isDescending, {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.phoenix.expression.function.CollationKeyFunctionTest Test results: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1563//testReport/ Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1563//console This message is automatically generated. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda >Assignee: Shehzaad Nakhoda > Fix For: 4.12.0 > > Attachments: PHOENIX-4237_v1.patch > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212015#comment-16212015 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user snakhoda-sfdc commented on the issue: https://github.com/apache/phoenix/pull/275 @JamesRTaylor I'm not sure how to do that within this PR. Looking at https://github.com/blog/2141-squash-your-commits, I believe at the time you merge the PR, github should give you the option to squash all commits into one. Will that suffice? > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212002#comment-16212002 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on the issue: https://github.com/apache/phoenix/pull/275 Would you mind squashing all the commits into a single commit, @shehzaadn and I'll get this committed? > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211765#comment-16211765 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on the issue: https://github.com/apache/phoenix/pull/275 +1. Nice work, @shehzaadn! > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211282#comment-16211282 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user snakhoda-sfdc commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145747600 --- Diff: phoenix-core/src/main/java/com/ibm/icu/impl/jdkadapter/NumberFormatICU.java --- @@ -0,0 +1,229 @@ +// © 2016 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html#License +/* + *** + * Copyright (C) 2008, International Business Machines Corporation and * + * others. All Rights Reserved. * + *** + */ +package com.ibm.icu.impl.jdkadapter; + +import java.math.RoundingMode; +import java.text.FieldPosition; +import java.text.ParseException; +import java.text.ParsePosition; +import java.util.Currency; + +import com.ibm.icu.impl.icuadapter.NumberFormatJDK; +import com.ibm.icu.text.NumberFormat; + +/** + * NumberFormatICU is an adapter class which wraps ICU4J NumberFormat and + * implements java.text.NumberFormat APIs. + */ +public class NumberFormatICU extends java.text.NumberFormat { + +private static final long serialVersionUID = 4892903815641574060L; + +private NumberFormat fIcuNfmt; + +private NumberFormatICU(NumberFormat icuNfmt) { +fIcuNfmt = icuNfmt; +} + +public static java.text.NumberFormat wrap(NumberFormat icuNfmt) { +if (icuNfmt instanceof NumberFormatJDK) { +return ((NumberFormatJDK)icuNfmt).unwrap(); +} +return new NumberFormatICU(icuNfmt); +} + +public NumberFormat unwrap() { +return fIcuNfmt; +} + +@Override +public Object clone() { +NumberFormatICU other = (NumberFormatICU)super.clone(); +other.fIcuNfmt = (NumberFormat)fIcuNfmt.clone(); +return other; +} + +@Override +public boolean equals(Object obj) { +if (obj instanceof NumberFormatICU) { +return ((NumberFormatICU)obj).fIcuNfmt.equals(fIcuNfmt); +} +return false; +} + +//public String format(double number) --- End diff -- Thanks for taking a look at this PR, @solzy. This code is external and simply copied over from ICU4J 59.1. The reason it's here at all is that that project doesn't have all its artifacts in maven. I'm hoping to have a new PR in the near future to remove this external code and replace it with maven dependencies. CC: @JamesRTaylor > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210750#comment-16210750 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user solzy commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145639624 --- Diff: phoenix-core/src/main/java/com/ibm/icu/impl/jdkadapter/NumberFormatICU.java --- @@ -0,0 +1,229 @@ +// © 2016 and later: Unicode, Inc. and others. +// License & terms of use: http://www.unicode.org/copyright.html#License +/* + *** + * Copyright (C) 2008, International Business Machines Corporation and * + * others. All Rights Reserved. * + *** + */ +package com.ibm.icu.impl.jdkadapter; + +import java.math.RoundingMode; +import java.text.FieldPosition; +import java.text.ParseException; +import java.text.ParsePosition; +import java.util.Currency; + +import com.ibm.icu.impl.icuadapter.NumberFormatJDK; +import com.ibm.icu.text.NumberFormat; + +/** + * NumberFormatICU is an adapter class which wraps ICU4J NumberFormat and + * implements java.text.NumberFormat APIs. + */ +public class NumberFormatICU extends java.text.NumberFormat { + +private static final long serialVersionUID = 4892903815641574060L; + +private NumberFormat fIcuNfmt; + +private NumberFormatICU(NumberFormat icuNfmt) { +fIcuNfmt = icuNfmt; +} + +public static java.text.NumberFormat wrap(NumberFormat icuNfmt) { +if (icuNfmt instanceof NumberFormatJDK) { +return ((NumberFormatJDK)icuNfmt).unwrap(); +} +return new NumberFormatICU(icuNfmt); +} + +public NumberFormat unwrap() { +return fIcuNfmt; +} + +@Override +public Object clone() { +NumberFormatICU other = (NumberFormatICU)super.clone(); +other.fIcuNfmt = (NumberFormat)fIcuNfmt.clone(); +return other; +} + +@Override +public boolean equals(Object obj) { +if (obj instanceof NumberFormatICU) { +return ((NumberFormatICU)obj).fIcuNfmt.equals(fIcuNfmt); +} +return false; +} + +//public String format(double number) --- End diff -- delete this unusable lien, keep clean! > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210656#comment-16210656 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user snakhoda-sfdc commented on the issue: https://github.com/apache/phoenix/pull/275 @JamesRTaylor i've addressed the last round of comments in this commit (9d6d4f7). Thanks. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209676#comment-16209676 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on the issue: https://github.com/apache/phoenix/pull/275 Looking very good. Couple minor nits and the testing needs to be rounded out just a bit. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209673#comment-16209673 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145475423 --- Diff: phoenix-core/src/test/java/org/apache/phoenix/expression/function/CollationKeyFunctionTest.java --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.phoenix.expression.function; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertArrayEquals; +import static org.junit.Assert.fail; + +import java.util.List; + +import org.apache.commons.codec.binary.Hex; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.function.CollationKeyFunction; +import org.apache.phoenix.schema.SortOrder; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PVarchar; + +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.expression.LiteralExpression; + +import org.junit.Test; + +import com.google.common.collect.Lists; + +/** + * "Unit" tests for CollationKeyFunction + * + * @author snakhoda-sfdc + * + */ +public class CollationKeyFunctionTest { + + @Test + public void testChineseCollationKeyBytes() throws Exception { + + // Chinese (China) + test("\u963f", "zh", "02eb0001"); + test("\u55c4", "zh", "14ad0001"); + test("\u963e", "zh", "8000963f00010001"); + test("\u554a", "zh", "02ea0001"); + test("\u4ec8", "zh", "80004ec900010001"); + test("\u3d9a", "zh", "80003d9b00010001"); + test("\u9f51", "zh", "19050001"); + + // Chinese (Taiwan) + test("\u963f", "zh_TW", "063d0001"); + test("\u55c4", "zh_TW", "241e0001"); + test("\u963e", "zh_TW", "8000963f00010001"); + test("\u554a", "zh_TW", "09c90001"); + test("\u4ec8", "zh_TW", "181b0001"); + test("\u3d9a", "zh_TW", "80003d9b00010001"); + test("\u9f51", "zh_TW", "80009f5200010001"); + + // Chinese (Taiwan, Stroke) + test("\u963f", "zh_TW_STROKE", "5450010500"); + test("\u55c4", "zh_TW_STROKE", "7334010500"); + test("\u963e", "zh_TW_STROKE", "544f010500"); + test("\u554a", "zh_TW_STROKE", "62de010500"); + test("\u4ec8", "zh_TW_STROKE", "46be010500"); + test("\u3d9a", "zh_TW_STROKE", "a50392010500"); + test("\u9f51", "zh_TW_STROKE", "8915010500"); + + // Chinese (China, Stroke) + test("\u963f", "zh__STROKE", "28010500"); + test("\u55c4", "zh__STROKE", "2a010500"); + test("\u963e", "zh__STROKE", "7575010500"); + test("\u554a", "zh__STROKE", "2b010500"); + test("\u4ec8", "zh__STROKE", "51a1010500"); + test("\u3d9a", "zh__STROKE", "a50392010500"); + test("\u9f51", "zh__STROKE", "6935010500"); + + // Chinese (China, Pinyin) + test("\u963f", "zh__PINYIN", "28010500"); + test("\u55c4", "zh__PINYIN", "2a010500"); + test("\u963e", "zh__PINYIN", "7575010500"); + test("\u554a", "zh__PINYIN", "2b010500"); + test("\u4ec8", "zh__PINYIN", "51a1010500"); + test("\u3d9a", "zh__PINYIN", "a50392010500"); + test("\u9f51", "zh__PINYIN", "6935010500"); + + } + + private
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209670#comment-16209670 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145474350 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.phoenix.expression.function; + +import java.io.DataInput; +import java.io.IOException; +import java.sql.SQLException; +import java.text.Collator; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.expression.LiteralExpression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.util.VarBinaryFormatter; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +/** + * A Phoenix Function that calculates a collation key for an input + * string based on a caller-provided locale and collator strength and + * decomposition settings. + * + * The locale should be specified as xx_yy_variant where xx is the ISO + * 639-1 2-letter language code, yy is the the ISO 3166 2-letter + * country code. Both countryCode and variant are optional. For + * example, zh_TW_STROKE, zh_TW and zh are all valid locale + * representations. Note the language code, country code and variant + * are used as arguments to the constructor of java.util.Locale. + * + * This function uses the open-source grammaticus and i18n-util + * packages to obtain the collators it needs from the provided locale. + * + * The LinguisticSort implementation in i18n-util encapsulates + * sort-related functionality for a substantive list of locales. For + * each locale, it provides a collator and an Oracle-specific database + * function that can be used to sort strings according to the natural + * language rules of that locale. + * + * This function uses the collator returned by + * LinguisticSort.getCollator to produce a collation key for its input + * string. A user can expect that the sorting semantics of this + * function for a given locale is equivalent to the sorting behaviour + * of an Oracle query that is constructed using the Oracle functions + * returned by LinguisticSort for that locale. + * + * The optional third argument to the function is a boolean that + * specifies whether to use the upper-case collator (case-insensitive) + * returned by LinguisticSort.getUpperCaseCollator. + * + * The optional fourth and fifth arguments are used to set + * respectively the strength and composition of the collator returned + * by LinguisticSort using the setStrength and setDecomposition + * methods of java.text.Collator. + * + * @author snakhoda-sfdc + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator +
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209669#comment-16209669 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145473757 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.phoenix.expression.function; + +import java.io.DataInput; +import java.io.IOException; +import java.sql.SQLException; +import java.text.Collator; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.expression.LiteralExpression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.util.VarBinaryFormatter; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +/** + * A Phoenix Function that calculates a collation key for an input + * string based on a caller-provided locale and collator strength and + * decomposition settings. + * + * The locale should be specified as xx_yy_variant where xx is the ISO + * 639-1 2-letter language code, yy is the the ISO 3166 2-letter + * country code. Both countryCode and variant are optional. For + * example, zh_TW_STROKE, zh_TW and zh are all valid locale + * representations. Note the language code, country code and variant + * are used as arguments to the constructor of java.util.Locale. + * + * This function uses the open-source grammaticus and i18n-util + * packages to obtain the collators it needs from the provided locale. + * + * The LinguisticSort implementation in i18n-util encapsulates + * sort-related functionality for a substantive list of locales. For + * each locale, it provides a collator and an Oracle-specific database + * function that can be used to sort strings according to the natural + * language rules of that locale. + * + * This function uses the collator returned by + * LinguisticSort.getCollator to produce a collation key for its input + * string. A user can expect that the sorting semantics of this + * function for a given locale is equivalent to the sorting behaviour + * of an Oracle query that is constructed using the Oracle functions + * returned by LinguisticSort for that locale. + * + * The optional third argument to the function is a boolean that + * specifies whether to use the upper-case collator (case-insensitive) + * returned by LinguisticSort.getUpperCaseCollator. + * + * The optional fourth and fifth arguments are used to set + * respectively the strength and composition of the collator returned + * by LinguisticSort using the setStrength and setDecomposition + * methods of java.text.Collator. + * + * @author snakhoda-sfdc + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator +
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209667#comment-16209667 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145473581 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.phoenix.expression.function; + +import java.io.DataInput; +import java.io.IOException; +import java.sql.SQLException; +import java.text.Collator; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.expression.LiteralExpression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.util.VarBinaryFormatter; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +/** + * A Phoenix Function that calculates a collation key for an input + * string based on a caller-provided locale and collator strength and + * decomposition settings. + * + * The locale should be specified as xx_yy_variant where xx is the ISO + * 639-1 2-letter language code, yy is the the ISO 3166 2-letter + * country code. Both countryCode and variant are optional. For + * example, zh_TW_STROKE, zh_TW and zh are all valid locale + * representations. Note the language code, country code and variant + * are used as arguments to the constructor of java.util.Locale. + * + * This function uses the open-source grammaticus and i18n-util + * packages to obtain the collators it needs from the provided locale. + * + * The LinguisticSort implementation in i18n-util encapsulates + * sort-related functionality for a substantive list of locales. For + * each locale, it provides a collator and an Oracle-specific database + * function that can be used to sort strings according to the natural + * language rules of that locale. + * + * This function uses the collator returned by + * LinguisticSort.getCollator to produce a collation key for its input + * string. A user can expect that the sorting semantics of this + * function for a given locale is equivalent to the sorting behaviour + * of an Oracle query that is constructed using the Oracle functions + * returned by LinguisticSort for that locale. + * + * The optional third argument to the function is a boolean that + * specifies whether to use the upper-case collator (case-insensitive) + * returned by LinguisticSort.getUpperCaseCollator. + * + * The optional fourth and fifth arguments are used to set + * respectively the strength and composition of the collator returned + * by LinguisticSort using the setStrength and setDecomposition + * methods of java.text.Collator. + * + * @author snakhoda-sfdc + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator +
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208338#comment-16208338 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user snakhoda-sfdc commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145257151 --- Diff: phoenix-core/src/test/java/org/apache/phoenix/expression/function/CollationKeyFunctionTest.java --- @@ -96,33 +96,35 @@ private static boolean testExpression(String inputStr, String localeIsoCode, Sor strengthLiteral = LiteralExpression.newConstant(null, PInteger.INSTANCE, sortOrder); decompositionLiteral = LiteralExpression.newConstant(null, PInteger.INSTANCE, sortOrder); boolean ret = testExpression(inputStrLiteral, localeIsoCodeLiteral, upperCaseBooleanLiteral, strengthLiteral, - decompositionLiteral, new PhoenixArray(PInteger.INSTANCE, expectedCollationKeyBytes)); + decompositionLiteral, expectedCollationKeyBytesHex); return ret; } private static boolean testExpression(LiteralExpression inputStrLiteral, LiteralExpression localeIsoCodeLiteral, LiteralExpression upperCaseBooleanLiteral, LiteralExpression strengthLiteral, - LiteralExpression decompositionLiteral, PhoenixArray expectedCollationKeyByteArray) throws SQLException { + LiteralExpression decompositionLiteral, String expectedCollationKeyBytesHex) throws Exception { List expressions = Lists.newArrayList((Expression) inputStrLiteral, (Expression) localeIsoCodeLiteral, (Expression) upperCaseBooleanLiteral, (Expression) strengthLiteral, (Expression) decompositionLiteral); Expression collationKeyFunction = new CollationKeyFunction(expressions); ImmutableBytesWritable ptr = new ImmutableBytesWritable(); boolean ret = collationKeyFunction.evaluate(null, ptr); if (ret) { - PhoenixArray result = (PhoenixArray) collationKeyFunction.getDataType().toObject(ptr, + byte[] result = (byte[]) collationKeyFunction.getDataType().toObject(ptr, collationKeyFunction.getSortOrder()); + byte[] expectedCollationKeyByteArray = Hex.decodeHex(expectedCollationKeyBytesHex.toCharArray()); + --- End diff -- Good point. Will do. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208285#comment-16208285 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145247898 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,221 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; +import org.apache.phoenix.util.VarBinaryFormatter; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; --- End diff -- Yes, that's fine. Let's use COLLATION_KEY as the built-in function name. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208283#comment-16208283 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145247513 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,221 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; +import org.apache.phoenix.util.VarBinaryFormatter; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; + + public CollationKeyFunction() { + } + + public CollationKeyFunction(List children) throws SQLException { + super(children); + } + + @Override + public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) { + try { + String inputValue = getInputValue(tuple, ptr); --- End diff -- You can indicate that a function is not thread safe. I'll give you an easy way to do that and let you know what you need to do. In the meantime, if you could do the above, that'd be good. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208271#comment-16208271 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user snakhoda-sfdc commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145245002 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,221 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; +import org.apache.phoenix.util.VarBinaryFormatter; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; + + public CollationKeyFunction() { + } + + public CollationKeyFunction(List children) throws SQLException { + super(children); + } + + @Override + public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) { + try { + String inputValue = getInputValue(tuple, ptr); --- End diff -- @JamesRTaylor Won't that require that the collator be thread-safe? Or will the CollationKeyFunction not be shared across threads? (Maybe the tweak you were mentioning is for this purpose?) > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208269#comment-16208269 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user snakhoda-sfdc commented on the issue: https://github.com/apache/phoenix/pull/275 @JamesRTaylor thanks for the feedback and support! So we have the i18n-util jar on maven now, but not the icu4j jars. Once the icu4j jars are published to maven, i18n-util will have to change to upgrade its dependency to the new version. I'm hoping that change will be in next week. Once that happens, I was thinking of creating a new PR that removes the outside code and introduces the external dependency. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208261#comment-16208261 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user snakhoda-sfdc commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145243623 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,221 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; +import org.apache.phoenix.util.VarBinaryFormatter; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; + + public CollationKeyFunction() { + } + + public CollationKeyFunction(List children) throws SQLException { + super(children); + } + + @Override + public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) { + try { + String inputValue = getInputValue(tuple, ptr); + String localeISOCode = getLocaleISOCode(tuple, ptr); + Boolean useSpecialUpperCaseCollator = getUseSpecialUpperCaseCollator(tuple, ptr); + Integer collatorStrength = getCollatorStrength(tuple, ptr); + Integer collatorDecomposition = getCollatorDecomposition(tuple, ptr); + + Locale locale = LocaleUtils.get().getLocaleByIsoCode(localeISOCode); + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Locale: " + locale.toLanguageTag())); + } + + LinguisticSort linguisticSort = LinguisticSort.get(locale); + + Collator collator = BooleanUtils.isTrue(useSpecialUpperCaseCollator) + ? linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator(); + + if (collatorStrength != null) { + collator.setStrength(collatorStrength); + } + + if (collatorDecomposition != null) { + collator.setDecomposition(collatorDecomposition); + } + + if(LOG.isDebugEnabled()) { +
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208259#comment-16208259 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user snakhoda-sfdc commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145243141 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,221 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; +import org.apache.phoenix.util.VarBinaryFormatter; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; --- End diff -- There's no convention as such. Oracle functions are nlssort/ nls_upper, etc. We can call it COLLATION_KEY here. I'd rather have the name be more descriptive than less. Does that work? > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208215#comment-16208215 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on the issue: https://github.com/apache/phoenix/pull/275 This is looking very good, @shehzaadn - thanks for the revisions. Couple more comments, but it's getting pretty close IMHO. How is the publishing to maven of the dependent jars looking? > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208212#comment-16208212 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145233589 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,221 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; +import org.apache.phoenix.util.VarBinaryFormatter; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; --- End diff -- Is there a convention in other RDBMS for the name of this function? Is it spelled out COLLATION_KEY or abbreviated as you've done? If abbreviated, then IMHO, it'd be better to name the class and unit tests CollKeyFunction, CollKeyFunctionIT, etc. to make it easier to find (i.e. based on the function name). That's our typical convention. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208207#comment-16208207 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145232581 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/jdbc/PhoenixConnection.java --- @@ -336,6 +338,7 @@ public ReadOnlyProps getProps() { formatters.put(PUnsignedTimestamp.INSTANCE, timestampFormat); formatters.put(PDecimal.INSTANCE, FunctionArgumentType.NUMERIC.getFormatter(numberPattern)); +formatters.put(PVarbinary.INSTANCE, VarBinaryFormatter.INSTANCE); --- End diff -- +1. Nice idea! > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208205#comment-16208205 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145232171 --- Diff: phoenix-core/src/test/java/org/apache/phoenix/expression/function/CollationKeyFunctionTest.java --- @@ -96,33 +96,35 @@ private static boolean testExpression(String inputStr, String localeIsoCode, Sor strengthLiteral = LiteralExpression.newConstant(null, PInteger.INSTANCE, sortOrder); decompositionLiteral = LiteralExpression.newConstant(null, PInteger.INSTANCE, sortOrder); boolean ret = testExpression(inputStrLiteral, localeIsoCodeLiteral, upperCaseBooleanLiteral, strengthLiteral, - decompositionLiteral, new PhoenixArray(PInteger.INSTANCE, expectedCollationKeyBytes)); + decompositionLiteral, expectedCollationKeyBytesHex); return ret; } private static boolean testExpression(LiteralExpression inputStrLiteral, LiteralExpression localeIsoCodeLiteral, LiteralExpression upperCaseBooleanLiteral, LiteralExpression strengthLiteral, - LiteralExpression decompositionLiteral, PhoenixArray expectedCollationKeyByteArray) throws SQLException { + LiteralExpression decompositionLiteral, String expectedCollationKeyBytesHex) throws Exception { List expressions = Lists.newArrayList((Expression) inputStrLiteral, (Expression) localeIsoCodeLiteral, (Expression) upperCaseBooleanLiteral, (Expression) strengthLiteral, (Expression) decompositionLiteral); Expression collationKeyFunction = new CollationKeyFunction(expressions); ImmutableBytesWritable ptr = new ImmutableBytesWritable(); boolean ret = collationKeyFunction.evaluate(null, ptr); if (ret) { - PhoenixArray result = (PhoenixArray) collationKeyFunction.getDataType().toObject(ptr, + byte[] result = (byte[]) collationKeyFunction.getDataType().toObject(ptr, collationKeyFunction.getSortOrder()); + byte[] expectedCollationKeyByteArray = Hex.decodeHex(expectedCollationKeyBytesHex.toCharArray()); + --- End diff -- Why not use assertArrayEquals here instead? > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208200#comment-16208200 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145230960 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,221 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; +import org.apache.phoenix.util.VarBinaryFormatter; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; + + public CollationKeyFunction() { + } + + public CollationKeyFunction(List children) throws SQLException { + super(children); + } + + @Override + public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) { + try { + String inputValue = getInputValue(tuple, ptr); + String localeISOCode = getLocaleISOCode(tuple, ptr); + Boolean useSpecialUpperCaseCollator = getUseSpecialUpperCaseCollator(tuple, ptr); + Integer collatorStrength = getCollatorStrength(tuple, ptr); + Integer collatorDecomposition = getCollatorDecomposition(tuple, ptr); + + Locale locale = LocaleUtils.get().getLocaleByIsoCode(localeISOCode); + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Locale: " + locale.toLanguageTag())); + } + + LinguisticSort linguisticSort = LinguisticSort.get(locale); + + Collator collator = BooleanUtils.isTrue(useSpecialUpperCaseCollator) + ? linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator(); + + if (collatorStrength != null) { + collator.setStrength(collatorStrength); + } + + if (collatorDecomposition != null) { + collator.setDecomposition(collatorDecomposition); + } + + if(LOG.isDebugEnabled()) { +
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208197#comment-16208197 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r145230698 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,221 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; +import org.apache.phoenix.util.VarBinaryFormatter; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; + + public CollationKeyFunction() { + } + + public CollationKeyFunction(List children) throws SQLException { + super(children); + } + + @Override + public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) { + try { + String inputValue = getInputValue(tuple, ptr); --- End diff -- The evaluate method is called for every row during processing, so we want to have as little code here as possible. You can create a Collator local variable and move all the code that sets it up to an init() method. You'd call the init() method in the CollationKeyFunction(List children) constructor and in an overridden readFields method like this (see InstrFunction for an example): @Override public void readFields(DataInput input) throws IOException { super.readFields(input); init(); } > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208145#comment-16208145 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user snakhoda-sfdc commented on the issue: https://github.com/apache/phoenix/pull/275 @JamesRTaylor Thanks for your comments. I added two further commits: 199c389: This addresses your comment about the byte array comparison. You were right! I must have got confused earlier with what was being displayed on sqlline.py not matching the sort order. I also added a formatter for PVarBinary because without it you simply get a Java hash code in sqlline.py which is hard to do anything with. 8cc2b5c: This adds the end-to-end tests you mentioned and also changes the unit test to use the hex representation of the byte array to make it easier to read. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203947#comment-16203947 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r144620511 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,233 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; + + public CollationKeyFunction() { + } + + public CollationKeyFunction(List children) throws SQLException { + super(children); + } + + @Override + public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) { + try { + String inputValue = getInputValue(tuple, ptr); + String localeISOCode = getLocaleISOCode(tuple, ptr); + Boolean useSpecialUpperCaseCollator = getUseSpecialUpperCaseCollator(tuple, ptr); + Integer collatorStrength = getCollatorStrength(tuple, ptr); + Integer collatorDecomposition = getCollatorDecomposition(tuple, ptr); + + Locale locale = LocaleUtils.get().getLocaleByIsoCode(localeISOCode); + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Locale: " + locale.toLanguageTag())); + } + + LinguisticSort linguisticSort = LinguisticSort.get(locale); + + Collator collator = BooleanUtils.isTrue(useSpecialUpperCaseCollator) + ? linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator(); + + if (collatorStrength != null) { + collator.setStrength(collatorStrength); + } + + if (collatorDecomposition != null) { + collator.setDecomposition(collatorDecomposition); + } + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Collator: [strength:
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203839#comment-16203839 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user shehzaadn-vd commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r144604412 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,233 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; + + public CollationKeyFunction() { + } + + public CollationKeyFunction(List children) throws SQLException { + super(children); + } + + @Override + public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) { + try { + String inputValue = getInputValue(tuple, ptr); + String localeISOCode = getLocaleISOCode(tuple, ptr); + Boolean useSpecialUpperCaseCollator = getUseSpecialUpperCaseCollator(tuple, ptr); + Integer collatorStrength = getCollatorStrength(tuple, ptr); + Integer collatorDecomposition = getCollatorDecomposition(tuple, ptr); + + Locale locale = LocaleUtils.get().getLocaleByIsoCode(localeISOCode); + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Locale: " + locale.toLanguageTag())); + } + + LinguisticSort linguisticSort = LinguisticSort.get(locale); + + Collator collator = BooleanUtils.isTrue(useSpecialUpperCaseCollator) + ? linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator(); + + if (collatorStrength != null) { + collator.setStrength(collatorStrength); + } + + if (collatorDecomposition != null) { + collator.setDecomposition(collatorDecomposition); + } + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Collator: [strength:
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203799#comment-16203799 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r144600094 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,233 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; + + public CollationKeyFunction() { + } + + public CollationKeyFunction(List children) throws SQLException { + super(children); + } + + @Override + public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) { + try { + String inputValue = getInputValue(tuple, ptr); + String localeISOCode = getLocaleISOCode(tuple, ptr); + Boolean useSpecialUpperCaseCollator = getUseSpecialUpperCaseCollator(tuple, ptr); + Integer collatorStrength = getCollatorStrength(tuple, ptr); + Integer collatorDecomposition = getCollatorDecomposition(tuple, ptr); + + Locale locale = LocaleUtils.get().getLocaleByIsoCode(localeISOCode); + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Locale: " + locale.toLanguageTag())); + } + + LinguisticSort linguisticSort = LinguisticSort.get(locale); + + Collator collator = BooleanUtils.isTrue(useSpecialUpperCaseCollator) + ? linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator(); + + if (collatorStrength != null) { + collator.setStrength(collatorStrength); + } + + if (collatorDecomposition != null) { + collator.setDecomposition(collatorDecomposition); + } + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Collator: [strength:
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203159#comment-16203159 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user snakhoda-sfdc commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r144483837 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,233 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; + + public CollationKeyFunction() { + } + + public CollationKeyFunction(List children) throws SQLException { + super(children); + } + + @Override + public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) { + try { + String inputValue = getInputValue(tuple, ptr); + String localeISOCode = getLocaleISOCode(tuple, ptr); + Boolean useSpecialUpperCaseCollator = getUseSpecialUpperCaseCollator(tuple, ptr); + Integer collatorStrength = getCollatorStrength(tuple, ptr); + Integer collatorDecomposition = getCollatorDecomposition(tuple, ptr); + + Locale locale = LocaleUtils.get().getLocaleByIsoCode(localeISOCode); + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Locale: " + locale.toLanguageTag())); + } + + LinguisticSort linguisticSort = LinguisticSort.get(locale); + + Collator collator = BooleanUtils.isTrue(useSpecialUpperCaseCollator) + ? linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator(); + + if (collatorStrength != null) { + collator.setStrength(collatorStrength); + } + + if (collatorDecomposition != null) { + collator.setDecomposition(collatorDecomposition); + } + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Collator: [strength:
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202669#comment-16202669 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on the issue: https://github.com/apache/phoenix/pull/275 Thanks for the patch, @shehzaadn. This looks like a general enough built-in function to include in Phoenix IMHO. See inline for more specific comments. It'd be much better to include the first two commits as external dependencies. If we don't do that, we'll need to quickly follow up with replacing them with external dependencies (and make sure we don't change those files at all). > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202667#comment-16202667 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r144416251 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,233 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. --- End diff -- We should include more comments here. In particular, what sort order will we get? Does this mimic some other databases behavior (i.e. Oracle)? Does it deviate from that at all? Does Oracle follow some standard that we could point to? Also, please make sure to budget time to update our online reference manual: https://phoenix.apache.org/language/functions.html. This lives in phoenix.csv in our SVN repo as described here: https://phoenix.apache.org/building_website.html > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202662#comment-16202662 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r144415724 --- Diff: phoenix-core/src/test/java/org/apache/phoenix/expression/function/CollationKeyFunctionTest.java --- @@ -0,0 +1,134 @@ +package org.apache.phoenix.expression.function; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.fail; + +import java.sql.SQLException; +import java.util.List; + +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.function.CollationKeyFunction; +import org.apache.phoenix.schema.SortOrder; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; + +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.expression.LiteralExpression; + +import org.junit.Test; + +import com.google.common.collect.Lists; + +/** + * "Unit" tests for CollationKeyFunction + * + * @author snakhoda + * + */ +public class CollationKeyFunctionTest { --- End diff -- We'll need more tests. You really want to test the sort order of a list of strings matches the expected linguistic sort order. These tests don't have a lot of meaning in terms of validating the sort order is correct IMHO. We'll also want end2end tests that use the new function. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202655#comment-16202655 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user joshelser commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r144414821 --- Diff: phoenix-core/src/main/java/com/force/db/i18n/OracleUpper.java --- @@ -0,0 +1,66 @@ +/* --- End diff -- Yup! You got it right, James. Whether we include the code in binary form or source form, for BSD, we treat them the same (propagate in LICENSE, and copyright/etc in NOTICE). If there's a license header for the file, we would also leave that, IIRC. > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202651#comment-16202651 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r144414623 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java --- @@ -0,0 +1,233 @@ +package org.apache.phoenix.expression.function; + +import java.sql.SQLException; +import java.text.Collator; +import java.util.Arrays; +import java.util.List; +import java.util.Locale; + +import org.apache.commons.lang.BooleanUtils; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.hbase.io.ImmutableBytesWritable; +import org.apache.phoenix.expression.Expression; +import org.apache.phoenix.parse.FunctionParseNode; +import org.apache.phoenix.schema.tuple.Tuple; +import org.apache.phoenix.schema.types.PBoolean; +import org.apache.phoenix.schema.types.PDataType; +import org.apache.phoenix.schema.types.PInteger; +import org.apache.phoenix.schema.types.PIntegerArray; +import org.apache.phoenix.schema.types.PUnsignedIntArray; +import org.apache.phoenix.schema.types.PVarbinary; +import org.apache.phoenix.schema.types.PVarchar; +import org.apache.phoenix.schema.types.PhoenixArray; + +import com.force.db.i18n.LinguisticSort; +import com.force.i18n.LocaleUtils; + +import com.ibm.icu.impl.jdkadapter.CollatorICU; +import com.ibm.icu.util.ULocale; + +/** + * A Phoenix Function that calculates a collation key for an input string based + * on a caller-provided locale and collator strength and decomposition settings. + * + * It uses the open-source grammaticus and i18n packages to obtain the collators + * it needs. + * + * @author snakhoda + * + */ +@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args = { + // input string + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }), + // ISO Code for Locale + @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, isConstant = true), + // whether to use special upper case collator + @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, defaultValue = "false", isConstant = true), + // collator strength + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true), + // collator decomposition + @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, defaultValue = "null", isConstant = true) }) +public class CollationKeyFunction extends ScalarFunction { + + private static final Log LOG = LogFactory.getLog(CollationKeyFunction.class); + + public static final String NAME = "COLLKEY"; + + public CollationKeyFunction() { + } + + public CollationKeyFunction(List children) throws SQLException { + super(children); + } + + @Override + public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) { + try { + String inputValue = getInputValue(tuple, ptr); + String localeISOCode = getLocaleISOCode(tuple, ptr); + Boolean useSpecialUpperCaseCollator = getUseSpecialUpperCaseCollator(tuple, ptr); + Integer collatorStrength = getCollatorStrength(tuple, ptr); + Integer collatorDecomposition = getCollatorDecomposition(tuple, ptr); + + Locale locale = LocaleUtils.get().getLocaleByIsoCode(localeISOCode); + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Locale: " + locale.toLanguageTag())); + } + + LinguisticSort linguisticSort = LinguisticSort.get(locale); + + Collator collator = BooleanUtils.isTrue(useSpecialUpperCaseCollator) + ? linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator(); + + if (collatorStrength != null) { + collator.setStrength(collatorStrength); + } + + if (collatorDecomposition != null) { + collator.setDecomposition(collatorDecomposition); + } + + if(LOG.isDebugEnabled()) { + LOG.debug(String.format("Collator: [strength:
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202645#comment-16202645 ] ASF GitHub Bot commented on PHOENIX-4237: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/275#discussion_r144413717 --- Diff: phoenix-core/src/main/java/com/force/db/i18n/OracleUpper.java --- @@ -0,0 +1,66 @@ +/* --- End diff -- @joshelser - my take, based on this[1], is that it's ok to include source code in an ASF project with a BSD license (as opposed to only having BSD licensed software as an external dependency). WDYT? [1] http://apache.org/licenses/#code-developed-elsewhere-received-under-a-category-a-license-incorporated-into-apache-projects-distributed-by-apache-and-licensed-to-downstream-users-under-its-original-license > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > Fix For: 4.12.0 > > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales
[ https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192500#comment-16192500 ] ASF GitHub Bot commented on PHOENIX-4237: - GitHub user shehzaadn opened a pull request: https://github.com/apache/phoenix/pull/275 PHOENIX-4237: Add function to calculate Java collation keys Here we implement a generalized solution for calculating Java collation keys by creating Java collators based on a user locale. These collation keys can then be used in an ORDER BY clause to sort strings in a natural-language-appropriate way. We add a new Phoenix function COLLKEY. In general usage for this function will be: select name from my_table order by COLLKEY(name, 'zh_TW') We use artifacts from the ICU4J project and recently open-sourced grammaticus project (by Maven dependency). We were forced to include some code from ICU4J because some jars produced by that project aren't published in Maven. We also include code from Salesforce that has been licensed for open-source release but not yet published as artifacts in maven. There are three commits that split the changes into three logical pieces: 1) f8cb121: Add the external source code described above 2) fdbb5e0: Make changes needed to the Phoenix license due to the above (and fix to what seems to be an existing bug) 3) 98cfc10: The actual function implementation of COLLKEY - new code that uses the code introduced above and newly introduced dependencies via maven. Thanks in advance to the Phoenix community for your feedback on this. You can merge this pull request into a Git repository by running: $ git pull https://github.com/shehzaadn/phoenix master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/phoenix/pull/275.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #275 commit f8cb121145163591345eea70acbc313098e23e21 Author: ShehzaadDate: 2017-09-30T01:52:46Z (1) add ICU4J source code for charset/localespi jars and (2) add Salesforce i18n-util source code commit fdbb5e009a767e0f6df385dc9a1a8472b32cc361 Author: Shehzaad Date: 2017-10-02T17:55:39Z (1) Fix text of 3-clause BSD License, (2) add Unicode license, (3) add mention of bundling ICU4J and i18n-util code commit 98cfc10bac3c48ec3e7ceb47bea0b60556265c85 Author: Shehzaad Date: 2017-10-02T21:58:31Z add function COLLKEY to Phoenix to calculate a Java collation key on a given string with the collator derived from an ISO locale code and some other parameters > Allow sorting on (Java) collation keys for non-English locales > -- > > Key: PHOENIX-4237 > URL: https://issues.apache.org/jira/browse/PHOENIX-4237 > Project: Phoenix > Issue Type: Improvement >Reporter: Shehzaad Nakhoda > > Strings stored via Phoenix can be composed from a subset of the entire set of > Unicode characters. The natural sort order for strings for different > languages often differs from the order dictated by the binary representation > of the characters of these strings. Java provides the idea of a Collator > which given an input string and a (language) locale can generate a Collation > Key which can then be used to compare strings in that natural order. > Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J > some time ago. These technologies can be combined to provide a robust new > Phoenix function that can be used in an ORDER BY clause to sort strings > according to the user's locale. -- This message was sent by Atlassian JIRA (v6.4.14#64029)