[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-11-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238173#comment-16238173
 ] 

Hudson commented on PHOENIX-4237:
-

SUCCESS: Integrated in Jenkins build Phoenix-master #1865 (See 
[https://builds.apache.org/job/Phoenix-master/1865/])
PHOENIX-4237 Allow sorting on (Java) collation keys for non-English (jtaylor: 
rev ee4355791acf3f31568fcd8c43367947d25a1386)
* (add) 
phoenix-core/src/it/java/org/apache/phoenix/end2end/CollationKeyFunctionIT.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/expression/ExpressionType.java
* (add) 
phoenix-core/src/test/java/org/apache/phoenix/expression/function/CollationKeyFunctionTest.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/jdbc/PhoenixConnection.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/util/VarBinaryFormatter.java
* (edit) LICENSE
* (edit) phoenix-server/pom.xml
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
* (edit) phoenix-core/pom.xml


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
>Assignee: Shehzaad Nakhoda
>Priority: Major
> Fix For: 4.13.0
>
> Attachments: PHOENIX-4237_v1.patch, PHOENIX-4237_v2.patch, 
> PHOENIX-4237_v3.patch
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-11-03 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237834#comment-16237834
 ] 

James Taylor commented on PHOENIX-4237:
---

+1. Great work, [~shehzaadn]!

> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
>Assignee: Shehzaad Nakhoda
>Priority: Major
> Fix For: 4.12.0
>
> Attachments: PHOENIX-4237_v1.patch, PHOENIX-4237_v2.patch, 
> PHOENIX-4237_v3.patch
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-11-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237360#comment-16237360
 ] 

Hadoop QA commented on PHOENIX-4237:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12895584/PHOENIX-4237_v3.patch
  against master branch at commit 1e48eabe4cbf72ce71fb0dbdd6053a9600133ee4.
  ATTACHMENT ID: 12895584

{color:red}-1 @author{color}.  The patch appears to contain 1 @author tags 
which the Hadoop community has agreed to not allow in code contributions.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW", 0, 6, new 
Integer[] { 0, 3, 4, 1, 5, 2, 6 });
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW_STROKE", 0, 
6, new Integer[] { 4, 2, 0, 3, 1, 6, 5 });
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh__STROKE", 0, 
6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 });
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh__PINYIN", 0, 
6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 });
+   queryWithCollKeyUpperCaseWithExpectedOrder("en", 7, 13, new 
Integer[] { 7, 10, 11, 13, 9, 12, 8 });
+   private void queryWithCollKeyDefaultArgsWithExpectedOrder(String 
localeString, Integer beginIndex, Integer endIndex,
+   "SELECT id, data FROM %s WHERE ID BETWEEN %d 
AND %d ORDER BY COLLATION_KEY(data, '%s')", tableName,
+   private void queryWithCollKeyUpperCaseWithExpectedOrder(String 
localeString, Integer beginIndex, Integer endIndex,
+   "SELECT id, data FROM %s WHERE ID BETWEEN %d 
AND %d ORDER BY COLLATION_KEY(data, '%s', true), id",
+   private void queryWithCollKeyWithStrengthWithExpectedOrder(String 
localeString, Integer strength, boolean isDescending,

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexFailureIT

Test results: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1614//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1614//console

This message is automatically generated.

> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
>Assignee: Shehzaad Nakhoda
>Priority: Major
> Fix For: 4.12.0
>
> Attachments: PHOENIX-4237_v1.patch, PHOENIX-4237_v2.patch, 
> PHOENIX-4237_v3.patch
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213562#comment-16213562
 ] 

Hadoop QA commented on PHOENIX-4237:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12893345/PHOENIX-4237_v2.patch
  against master branch at commit 7cdcb2313b08d2eaeb775f0c989642f8d416cfb6.
  ATTACHMENT ID: 12893345

{color:red}-1 @author{color}.  The patch appears to contain 18 @author tags 
which the Hadoop community has agreed to not allow in code contributions.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 release audit{color}.  The applied patch generated 60 release 
audit warnings (more than the master's current 0 warnings).

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW", 0, 6, new 
Integer[] { 0, 3, 4, 1, 5, 2, 6 });
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW_STROKE", 0, 
6, new Integer[] { 4, 2, 0, 3, 1, 6, 5 });
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh__STROKE", 0, 
6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 });
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh__PINYIN", 0, 
6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 });
+   queryWithCollKeyUpperCaseWithExpectedOrder("en", 7, 13, new 
Integer[] { 7, 10, 11, 13, 9, 12, 8 });
+   private void queryWithCollKeyDefaultArgsWithExpectedOrder(String 
localeString, Integer beginIndex, Integer endIndex,
+   "SELECT id, data FROM %s WHERE ID BETWEEN %d 
AND %d ORDER BY COLLATION_KEY(data, '%s')", tableName,
+   private void queryWithCollKeyUpperCaseWithExpectedOrder(String 
localeString, Integer beginIndex, Integer endIndex,
+   "SELECT id, data FROM %s WHERE ID BETWEEN %d 
AND %d ORDER BY COLLATION_KEY(data, '%s', true), id",
+   private void queryWithCollKeyWithStrengthWithExpectedOrder(String 
localeString, Integer strength, boolean isDescending,

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ReadIsolationLevelIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SetPropertyOnEncodedTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ConcurrentMutationsIT

Test results: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1565//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1565//artifact/patchprocess/patchReleaseAuditWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1565//console

This message is automatically generated.

> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
>Assignee: Shehzaad Nakhoda
> Fix For: 4.12.0
>
> Attachments: PHOENIX-4237_v1.patch, PHOENIX-4237_v2.patch
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213121#comment-16213121
 ] 

Hadoop QA commented on PHOENIX-4237:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12893307/PHOENIX-4237_v1.patch
  against master branch at commit 7cdcb2313b08d2eaeb775f0c989642f8d416cfb6.
  ATTACHMENT ID: 12893307

{color:red}-1 @author{color}.  The patch appears to contain 17 @author tags 
which the Hadoop community has agreed to not allow in code contributions.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW", 0, 6, new 
Integer[] { 0, 3, 4, 1, 5, 2, 6 });
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh_TW_STROKE", 0, 
6, new Integer[] { 4, 2, 0, 3, 1, 6, 5 });
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh__STROKE", 0, 
6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 });
+   queryWithCollKeyDefaultArgsWithExpectedOrder("zh__PINYIN", 0, 
6, new Integer[] { 0, 1, 3, 4, 6, 2, 5 });
+   queryWithCollKeyUpperCaseWithExpectedOrder("en", 7, 13, new 
Integer[] { 7, 10, 11, 13, 9, 12, 8 });
+   private void queryWithCollKeyDefaultArgsWithExpectedOrder(String 
localeString, Integer beginIndex, Integer endIndex,
+   "SELECT id, data FROM %s WHERE ID BETWEEN %d 
AND %d ORDER BY COLLATION_KEY(data, '%s')", tableName,
+   private void queryWithCollKeyUpperCaseWithExpectedOrder(String 
localeString, Integer beginIndex, Integer endIndex,
+   "SELECT id, data FROM %s WHERE ID BETWEEN %d 
AND %d ORDER BY COLLATION_KEY(data, '%s', true), id",
+   private void queryWithCollKeyWithStrengthWithExpectedOrder(String 
localeString, Integer strength, boolean isDescending,

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.phoenix.expression.function.CollationKeyFunctionTest

Test results: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1563//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1563//console

This message is automatically generated.

> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
>Assignee: Shehzaad Nakhoda
> Fix For: 4.12.0
>
> Attachments: PHOENIX-4237_v1.patch
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212015#comment-16212015
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user snakhoda-sfdc commented on the issue:

https://github.com/apache/phoenix/pull/275
  
@JamesRTaylor I'm not sure how to do that within this PR. Looking at 
https://github.com/blog/2141-squash-your-commits, I believe at the time you 
merge the PR, github should give you the option to squash all commits into one. 
Will that suffice?




> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212002#comment-16212002
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on the issue:

https://github.com/apache/phoenix/pull/275
  
Would you mind squashing all the commits into a single commit, @shehzaadn 
and I'll get this committed?


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211765#comment-16211765
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on the issue:

https://github.com/apache/phoenix/pull/275
  
+1. Nice work, @shehzaadn! 


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211282#comment-16211282
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user snakhoda-sfdc commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145747600
  
--- Diff: 
phoenix-core/src/main/java/com/ibm/icu/impl/jdkadapter/NumberFormatICU.java ---
@@ -0,0 +1,229 @@
+// © 2016 and later: Unicode, Inc. and others.
+// License & terms of use: http://www.unicode.org/copyright.html#License
+/*
+ 
***
+ * Copyright (C) 2008, International Business Machines Corporation and 
*
+ * others. All Rights Reserved.
*
+ 
***
+ */
+package com.ibm.icu.impl.jdkadapter;
+
+import java.math.RoundingMode;
+import java.text.FieldPosition;
+import java.text.ParseException;
+import java.text.ParsePosition;
+import java.util.Currency;
+
+import com.ibm.icu.impl.icuadapter.NumberFormatJDK;
+import com.ibm.icu.text.NumberFormat;
+
+/**
+ * NumberFormatICU is an adapter class which wraps ICU4J NumberFormat and
+ * implements java.text.NumberFormat APIs.
+ */
+public class NumberFormatICU extends java.text.NumberFormat {
+
+private static final long serialVersionUID = 4892903815641574060L;
+
+private NumberFormat fIcuNfmt;
+
+private NumberFormatICU(NumberFormat icuNfmt) {
+fIcuNfmt = icuNfmt;
+}
+
+public static java.text.NumberFormat wrap(NumberFormat icuNfmt) {
+if (icuNfmt instanceof NumberFormatJDK) {
+return ((NumberFormatJDK)icuNfmt).unwrap();
+}
+return new NumberFormatICU(icuNfmt);
+}
+
+public NumberFormat unwrap() {
+return fIcuNfmt;
+}
+
+@Override
+public Object clone() {
+NumberFormatICU other = (NumberFormatICU)super.clone();
+other.fIcuNfmt = (NumberFormat)fIcuNfmt.clone();
+return other;
+}
+
+@Override
+public boolean equals(Object obj) {
+if (obj instanceof NumberFormatICU) {
+return ((NumberFormatICU)obj).fIcuNfmt.equals(fIcuNfmt);
+}
+return false;
+}
+
+//public String format(double number)
--- End diff --

Thanks for taking a look at this PR, @solzy. This code is external and 
simply copied over from ICU4J 59.1. The reason it's here at all is that that 
project doesn't have all its artifacts in maven. I'm hoping to have a new PR in 
the near future to remove this external code and replace it with maven 
dependencies. CC: @JamesRTaylor 


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210750#comment-16210750
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user solzy commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145639624
  
--- Diff: 
phoenix-core/src/main/java/com/ibm/icu/impl/jdkadapter/NumberFormatICU.java ---
@@ -0,0 +1,229 @@
+// © 2016 and later: Unicode, Inc. and others.
+// License & terms of use: http://www.unicode.org/copyright.html#License
+/*
+ 
***
+ * Copyright (C) 2008, International Business Machines Corporation and 
*
+ * others. All Rights Reserved.
*
+ 
***
+ */
+package com.ibm.icu.impl.jdkadapter;
+
+import java.math.RoundingMode;
+import java.text.FieldPosition;
+import java.text.ParseException;
+import java.text.ParsePosition;
+import java.util.Currency;
+
+import com.ibm.icu.impl.icuadapter.NumberFormatJDK;
+import com.ibm.icu.text.NumberFormat;
+
+/**
+ * NumberFormatICU is an adapter class which wraps ICU4J NumberFormat and
+ * implements java.text.NumberFormat APIs.
+ */
+public class NumberFormatICU extends java.text.NumberFormat {
+
+private static final long serialVersionUID = 4892903815641574060L;
+
+private NumberFormat fIcuNfmt;
+
+private NumberFormatICU(NumberFormat icuNfmt) {
+fIcuNfmt = icuNfmt;
+}
+
+public static java.text.NumberFormat wrap(NumberFormat icuNfmt) {
+if (icuNfmt instanceof NumberFormatJDK) {
+return ((NumberFormatJDK)icuNfmt).unwrap();
+}
+return new NumberFormatICU(icuNfmt);
+}
+
+public NumberFormat unwrap() {
+return fIcuNfmt;
+}
+
+@Override
+public Object clone() {
+NumberFormatICU other = (NumberFormatICU)super.clone();
+other.fIcuNfmt = (NumberFormat)fIcuNfmt.clone();
+return other;
+}
+
+@Override
+public boolean equals(Object obj) {
+if (obj instanceof NumberFormatICU) {
+return ((NumberFormatICU)obj).fIcuNfmt.equals(fIcuNfmt);
+}
+return false;
+}
+
+//public String format(double number)
--- End diff --

delete this unusable lien, keep clean!


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210656#comment-16210656
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user snakhoda-sfdc commented on the issue:

https://github.com/apache/phoenix/pull/275
  
@JamesRTaylor i've addressed the last round of comments in this commit 
(9d6d4f7). Thanks.


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209676#comment-16209676
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on the issue:

https://github.com/apache/phoenix/pull/275
  
Looking very good. Couple minor nits and the testing needs to be rounded 
out just a bit.


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209673#comment-16209673
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145475423
  
--- Diff: 
phoenix-core/src/test/java/org/apache/phoenix/expression/function/CollationKeyFunctionTest.java
 ---
@@ -0,0 +1,143 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.phoenix.expression.function;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertArrayEquals;
+import static org.junit.Assert.fail;
+
+import java.util.List;
+
+import org.apache.commons.codec.binary.Hex;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.function.CollationKeyFunction;
+import org.apache.phoenix.schema.SortOrder;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PVarchar;
+
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.expression.LiteralExpression;
+
+import org.junit.Test;
+
+import com.google.common.collect.Lists;
+
+/**
+ * "Unit" tests for CollationKeyFunction
+ * 
+ * @author snakhoda-sfdc
+ *
+ */
+public class CollationKeyFunctionTest {
+
+   @Test
+   public void testChineseCollationKeyBytes() throws Exception {
+   
+   // Chinese (China)
+   test("\u963f", "zh", "02eb0001");
+   test("\u55c4", "zh", "14ad0001");
+   test("\u963e", "zh", "8000963f00010001");
+   test("\u554a", "zh", "02ea0001");
+   test("\u4ec8", "zh", "80004ec900010001");
+   test("\u3d9a", "zh", "80003d9b00010001");
+   test("\u9f51", "zh", "19050001");
+   
+   // Chinese (Taiwan)
+   test("\u963f", "zh_TW", "063d0001");
+   test("\u55c4", "zh_TW", "241e0001");
+   test("\u963e", "zh_TW", "8000963f00010001");
+   test("\u554a", "zh_TW", "09c90001");
+   test("\u4ec8", "zh_TW", "181b0001");
+   test("\u3d9a", "zh_TW", "80003d9b00010001");
+   test("\u9f51", "zh_TW", "80009f5200010001");
+   
+   // Chinese (Taiwan, Stroke)
+   test("\u963f", "zh_TW_STROKE", "5450010500");
+   test("\u55c4", "zh_TW_STROKE", "7334010500");
+   test("\u963e", "zh_TW_STROKE", "544f010500");
+   test("\u554a", "zh_TW_STROKE", "62de010500");
+   test("\u4ec8", "zh_TW_STROKE", "46be010500");
+   test("\u3d9a", "zh_TW_STROKE", "a50392010500");
+   test("\u9f51", "zh_TW_STROKE", "8915010500");
+   
+   // Chinese (China, Stroke)
+   test("\u963f", "zh__STROKE", "28010500");
+   test("\u55c4", "zh__STROKE", "2a010500");
+   test("\u963e", "zh__STROKE", "7575010500");
+   test("\u554a", "zh__STROKE", "2b010500");
+   test("\u4ec8", "zh__STROKE", "51a1010500");
+   test("\u3d9a", "zh__STROKE", "a50392010500");
+   test("\u9f51", "zh__STROKE", "6935010500");
+   
+   // Chinese (China, Pinyin)
+   test("\u963f", "zh__PINYIN", "28010500");
+   test("\u55c4", "zh__PINYIN", "2a010500");
+   test("\u963e", "zh__PINYIN", "7575010500");
+   test("\u554a", "zh__PINYIN", "2b010500");
+   test("\u4ec8", "zh__PINYIN", "51a1010500");
+   test("\u3d9a", "zh__PINYIN", "a50392010500");
+   test("\u9f51", "zh__PINYIN", "6935010500");
+   
+   }
+
+   private 

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209670#comment-16209670
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145474350
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.phoenix.expression.function;
+
+import java.io.DataInput;
+import java.io.IOException;
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.expression.LiteralExpression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.util.VarBinaryFormatter;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input
+ * string based on a caller-provided locale and collator strength and
+ * decomposition settings.
+ * 
+ * The locale should be specified as xx_yy_variant where xx is the ISO
+ * 639-1 2-letter language code, yy is the the ISO 3166 2-letter
+ * country code. Both countryCode and variant are optional. For
+ * example, zh_TW_STROKE, zh_TW and zh are all valid locale
+ * representations. Note the language code, country code and variant
+ * are used as arguments to the constructor of java.util.Locale.
+ *
+ * This function uses the open-source grammaticus and i18n-util
+ * packages to obtain the collators it needs from the provided locale.
+ *
+ * The LinguisticSort implementation in i18n-util encapsulates
+ * sort-related functionality for a substantive list of locales. For
+ * each locale, it provides a collator and an Oracle-specific database
+ * function that can be used to sort strings according to the natural
+ * language rules of that locale.
+ *
+ * This function uses the collator returned by
+ * LinguisticSort.getCollator to produce a collation key for its input
+ * string. A user can expect that the sorting semantics of this
+ * function for a given locale is equivalent to the sorting behaviour
+ * of an Oracle query that is constructed using the Oracle functions
+ * returned by LinguisticSort for that locale.
+ *
+ * The optional third argument to the function is a boolean that
+ * specifies whether to use the upper-case collator (case-insensitive)
+ * returned by LinguisticSort.getUpperCaseCollator.
+ *
+ * The optional fourth and fifth arguments are used to set
+ * respectively the strength and composition of the collator returned
+ * by LinguisticSort using the setStrength and setDecomposition
+ * methods of java.text.Collator.
+ * 
+ * @author snakhoda-sfdc
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209669#comment-16209669
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145473757
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.phoenix.expression.function;
+
+import java.io.DataInput;
+import java.io.IOException;
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.expression.LiteralExpression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.util.VarBinaryFormatter;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input
+ * string based on a caller-provided locale and collator strength and
+ * decomposition settings.
+ * 
+ * The locale should be specified as xx_yy_variant where xx is the ISO
+ * 639-1 2-letter language code, yy is the the ISO 3166 2-letter
+ * country code. Both countryCode and variant are optional. For
+ * example, zh_TW_STROKE, zh_TW and zh are all valid locale
+ * representations. Note the language code, country code and variant
+ * are used as arguments to the constructor of java.util.Locale.
+ *
+ * This function uses the open-source grammaticus and i18n-util
+ * packages to obtain the collators it needs from the provided locale.
+ *
+ * The LinguisticSort implementation in i18n-util encapsulates
+ * sort-related functionality for a substantive list of locales. For
+ * each locale, it provides a collator and an Oracle-specific database
+ * function that can be used to sort strings according to the natural
+ * language rules of that locale.
+ *
+ * This function uses the collator returned by
+ * LinguisticSort.getCollator to produce a collation key for its input
+ * string. A user can expect that the sorting semantics of this
+ * function for a given locale is equivalent to the sorting behaviour
+ * of an Oracle query that is constructed using the Oracle functions
+ * returned by LinguisticSort for that locale.
+ *
+ * The optional third argument to the function is a boolean that
+ * specifies whether to use the upper-case collator (case-insensitive)
+ * returned by LinguisticSort.getUpperCaseCollator.
+ *
+ * The optional fourth and fifth arguments are used to set
+ * respectively the strength and composition of the collator returned
+ * by LinguisticSort using the setStrength and setDecomposition
+ * methods of java.text.Collator.
+ * 
+ * @author snakhoda-sfdc
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209667#comment-16209667
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145473581
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.phoenix.expression.function;
+
+import java.io.DataInput;
+import java.io.IOException;
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.expression.LiteralExpression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.util.VarBinaryFormatter;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input
+ * string based on a caller-provided locale and collator strength and
+ * decomposition settings.
+ * 
+ * The locale should be specified as xx_yy_variant where xx is the ISO
+ * 639-1 2-letter language code, yy is the the ISO 3166 2-letter
+ * country code. Both countryCode and variant are optional. For
+ * example, zh_TW_STROKE, zh_TW and zh are all valid locale
+ * representations. Note the language code, country code and variant
+ * are used as arguments to the constructor of java.util.Locale.
+ *
+ * This function uses the open-source grammaticus and i18n-util
+ * packages to obtain the collators it needs from the provided locale.
+ *
+ * The LinguisticSort implementation in i18n-util encapsulates
+ * sort-related functionality for a substantive list of locales. For
+ * each locale, it provides a collator and an Oracle-specific database
+ * function that can be used to sort strings according to the natural
+ * language rules of that locale.
+ *
+ * This function uses the collator returned by
+ * LinguisticSort.getCollator to produce a collation key for its input
+ * string. A user can expect that the sorting semantics of this
+ * function for a given locale is equivalent to the sorting behaviour
+ * of an Oracle query that is constructed using the Oracle functions
+ * returned by LinguisticSort for that locale.
+ *
+ * The optional third argument to the function is a boolean that
+ * specifies whether to use the upper-case collator (case-insensitive)
+ * returned by LinguisticSort.getUpperCaseCollator.
+ *
+ * The optional fourth and fifth arguments are used to set
+ * respectively the strength and composition of the collator returned
+ * by LinguisticSort using the setStrength and setDecomposition
+ * methods of java.text.Collator.
+ * 
+ * @author snakhoda-sfdc
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208338#comment-16208338
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user snakhoda-sfdc commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145257151
  
--- Diff: 
phoenix-core/src/test/java/org/apache/phoenix/expression/function/CollationKeyFunctionTest.java
 ---
@@ -96,33 +96,35 @@ private static boolean testExpression(String inputStr, 
String localeIsoCode, Sor
strengthLiteral = LiteralExpression.newConstant(null, 
PInteger.INSTANCE, sortOrder);
decompositionLiteral = LiteralExpression.newConstant(null, 
PInteger.INSTANCE, sortOrder);
boolean ret = testExpression(inputStrLiteral, 
localeIsoCodeLiteral, upperCaseBooleanLiteral, strengthLiteral,
-   decompositionLiteral, new 
PhoenixArray(PInteger.INSTANCE, expectedCollationKeyBytes));
+   decompositionLiteral, 
expectedCollationKeyBytesHex);
return ret;
}
 
private static boolean testExpression(LiteralExpression 
inputStrLiteral, LiteralExpression localeIsoCodeLiteral,
LiteralExpression upperCaseBooleanLiteral, 
LiteralExpression strengthLiteral,
-   LiteralExpression decompositionLiteral, PhoenixArray 
expectedCollationKeyByteArray) throws SQLException {
+   LiteralExpression decompositionLiteral, String 
expectedCollationKeyBytesHex) throws Exception {
List expressions = Lists.newArrayList((Expression) 
inputStrLiteral,
(Expression) localeIsoCodeLiteral, (Expression) 
upperCaseBooleanLiteral, (Expression) strengthLiteral,
(Expression) decompositionLiteral);
Expression collationKeyFunction = new 
CollationKeyFunction(expressions);
ImmutableBytesWritable ptr = new ImmutableBytesWritable();
boolean ret = collationKeyFunction.evaluate(null, ptr);
if (ret) {
-   PhoenixArray result = (PhoenixArray) 
collationKeyFunction.getDataType().toObject(ptr,
+   byte[] result = (byte[]) 
collationKeyFunction.getDataType().toObject(ptr,
collationKeyFunction.getSortOrder());
 
+   byte[] expectedCollationKeyByteArray = 
Hex.decodeHex(expectedCollationKeyBytesHex.toCharArray());
+   
--- End diff --

Good point. Will do.


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208285#comment-16208285
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145247898
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,221 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+import org.apache.phoenix.util.VarBinaryFormatter;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
--- End diff --

Yes, that's fine. Let's use COLLATION_KEY as the built-in function name.


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208283#comment-16208283
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145247513
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,221 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+import org.apache.phoenix.util.VarBinaryFormatter;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
+
+   public CollationKeyFunction() {
+   }
+
+   public CollationKeyFunction(List children) throws 
SQLException {
+   super(children);
+   }
+
+   @Override
+   public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {
+   try {
+   String inputValue = getInputValue(tuple, ptr);
--- End diff --

You can indicate that a function is not thread safe. I'll give you an easy 
way to do that and let you know what you need to do. In the meantime, if you 
could do the above, that'd be good.


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208271#comment-16208271
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user snakhoda-sfdc commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145245002
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,221 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+import org.apache.phoenix.util.VarBinaryFormatter;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
+
+   public CollationKeyFunction() {
+   }
+
+   public CollationKeyFunction(List children) throws 
SQLException {
+   super(children);
+   }
+
+   @Override
+   public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {
+   try {
+   String inputValue = getInputValue(tuple, ptr);
--- End diff --

@JamesRTaylor  Won't that require that the collator be thread-safe? Or will 
the CollationKeyFunction not be shared across threads? (Maybe the tweak you 
were mentioning is for this purpose?)


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208269#comment-16208269
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user snakhoda-sfdc commented on the issue:

https://github.com/apache/phoenix/pull/275
  
@JamesRTaylor thanks for the feedback and support! So we have the i18n-util 
jar on maven now, but not the icu4j jars. Once the icu4j jars are published to 
maven, i18n-util will have to change to upgrade its dependency to the new 
version. I'm hoping that change will be in next week.

Once that happens, I was thinking of creating a new PR that removes the 
outside code and introduces the external dependency.


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208261#comment-16208261
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user snakhoda-sfdc commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145243623
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,221 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+import org.apache.phoenix.util.VarBinaryFormatter;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
+
+   public CollationKeyFunction() {
+   }
+
+   public CollationKeyFunction(List children) throws 
SQLException {
+   super(children);
+   }
+
+   @Override
+   public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {
+   try {
+   String inputValue = getInputValue(tuple, ptr);
+   String localeISOCode = getLocaleISOCode(tuple, ptr);
+   Boolean useSpecialUpperCaseCollator = 
getUseSpecialUpperCaseCollator(tuple, ptr);
+   Integer collatorStrength = getCollatorStrength(tuple, 
ptr);
+   Integer collatorDecomposition = 
getCollatorDecomposition(tuple, ptr);
+
+   Locale locale = 
LocaleUtils.get().getLocaleByIsoCode(localeISOCode);
+   
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Locale: " + 
locale.toLanguageTag()));
+   }
+   
+   LinguisticSort linguisticSort = 
LinguisticSort.get(locale);
+
+   Collator collator = 
BooleanUtils.isTrue(useSpecialUpperCaseCollator)
+   ? 
linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator();
+
+   if (collatorStrength != null) {
+   collator.setStrength(collatorStrength);
+   }
+
+   if (collatorDecomposition != null) {
+   
collator.setDecomposition(collatorDecomposition);
+   }
+
+   if(LOG.isDebugEnabled()) {
+ 

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208259#comment-16208259
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user snakhoda-sfdc commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145243141
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,221 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+import org.apache.phoenix.util.VarBinaryFormatter;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
--- End diff --

There's no convention as such. Oracle functions are nlssort/ nls_upper, etc.

We can call it COLLATION_KEY here. I'd rather have the name be more 
descriptive than less. Does that work?


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208215#comment-16208215
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on the issue:

https://github.com/apache/phoenix/pull/275
  
This is looking very good, @shehzaadn - thanks for the revisions. Couple 
more comments, but it's getting pretty close IMHO. How is the publishing to 
maven of the dependent jars looking?


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208212#comment-16208212
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145233589
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,221 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+import org.apache.phoenix.util.VarBinaryFormatter;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
--- End diff --

Is there a convention in other RDBMS for the name of this function? Is it 
spelled out COLLATION_KEY or abbreviated as you've done? If abbreviated, then 
IMHO, it'd be better to name the class and unit tests CollKeyFunction, 
CollKeyFunctionIT, etc. to make it easier to find (i.e. based on the function 
name). That's our typical convention.


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208207#comment-16208207
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145232581
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/jdbc/PhoenixConnection.java ---
@@ -336,6 +338,7 @@ public ReadOnlyProps getProps() {
 formatters.put(PUnsignedTimestamp.INSTANCE, timestampFormat);
 formatters.put(PDecimal.INSTANCE,
 FunctionArgumentType.NUMERIC.getFormatter(numberPattern));
+formatters.put(PVarbinary.INSTANCE, VarBinaryFormatter.INSTANCE);
--- End diff --

+1. Nice idea!


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208205#comment-16208205
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145232171
  
--- Diff: 
phoenix-core/src/test/java/org/apache/phoenix/expression/function/CollationKeyFunctionTest.java
 ---
@@ -96,33 +96,35 @@ private static boolean testExpression(String inputStr, 
String localeIsoCode, Sor
strengthLiteral = LiteralExpression.newConstant(null, 
PInteger.INSTANCE, sortOrder);
decompositionLiteral = LiteralExpression.newConstant(null, 
PInteger.INSTANCE, sortOrder);
boolean ret = testExpression(inputStrLiteral, 
localeIsoCodeLiteral, upperCaseBooleanLiteral, strengthLiteral,
-   decompositionLiteral, new 
PhoenixArray(PInteger.INSTANCE, expectedCollationKeyBytes));
+   decompositionLiteral, 
expectedCollationKeyBytesHex);
return ret;
}
 
private static boolean testExpression(LiteralExpression 
inputStrLiteral, LiteralExpression localeIsoCodeLiteral,
LiteralExpression upperCaseBooleanLiteral, 
LiteralExpression strengthLiteral,
-   LiteralExpression decompositionLiteral, PhoenixArray 
expectedCollationKeyByteArray) throws SQLException {
+   LiteralExpression decompositionLiteral, String 
expectedCollationKeyBytesHex) throws Exception {
List expressions = Lists.newArrayList((Expression) 
inputStrLiteral,
(Expression) localeIsoCodeLiteral, (Expression) 
upperCaseBooleanLiteral, (Expression) strengthLiteral,
(Expression) decompositionLiteral);
Expression collationKeyFunction = new 
CollationKeyFunction(expressions);
ImmutableBytesWritable ptr = new ImmutableBytesWritable();
boolean ret = collationKeyFunction.evaluate(null, ptr);
if (ret) {
-   PhoenixArray result = (PhoenixArray) 
collationKeyFunction.getDataType().toObject(ptr,
+   byte[] result = (byte[]) 
collationKeyFunction.getDataType().toObject(ptr,
collationKeyFunction.getSortOrder());
 
+   byte[] expectedCollationKeyByteArray = 
Hex.decodeHex(expectedCollationKeyBytesHex.toCharArray());
+   
--- End diff --

Why not use assertArrayEquals here instead?


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208200#comment-16208200
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145230960
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,221 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+import org.apache.phoenix.util.VarBinaryFormatter;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
+
+   public CollationKeyFunction() {
+   }
+
+   public CollationKeyFunction(List children) throws 
SQLException {
+   super(children);
+   }
+
+   @Override
+   public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {
+   try {
+   String inputValue = getInputValue(tuple, ptr);
+   String localeISOCode = getLocaleISOCode(tuple, ptr);
+   Boolean useSpecialUpperCaseCollator = 
getUseSpecialUpperCaseCollator(tuple, ptr);
+   Integer collatorStrength = getCollatorStrength(tuple, 
ptr);
+   Integer collatorDecomposition = 
getCollatorDecomposition(tuple, ptr);
+
+   Locale locale = 
LocaleUtils.get().getLocaleByIsoCode(localeISOCode);
+   
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Locale: " + 
locale.toLanguageTag()));
+   }
+   
+   LinguisticSort linguisticSort = 
LinguisticSort.get(locale);
+
+   Collator collator = 
BooleanUtils.isTrue(useSpecialUpperCaseCollator)
+   ? 
linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator();
+
+   if (collatorStrength != null) {
+   collator.setStrength(collatorStrength);
+   }
+
+   if (collatorDecomposition != null) {
+   
collator.setDecomposition(collatorDecomposition);
+   }
+
+   if(LOG.isDebugEnabled()) {
+  

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208197#comment-16208197
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r145230698
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,221 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+import org.apache.phoenix.util.VarBinaryFormatter;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
+
+   public CollationKeyFunction() {
+   }
+
+   public CollationKeyFunction(List children) throws 
SQLException {
+   super(children);
+   }
+
+   @Override
+   public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {
+   try {
+   String inputValue = getInputValue(tuple, ptr);
--- End diff --

The evaluate method is called for every row during processing, so we want 
to have as little code here as possible. You can create a Collator local 
variable and move all the code that sets it up to an init() method. You'd call 
the init() method in the CollationKeyFunction(List children) 
constructor and in an overridden readFields method like this (see InstrFunction 
for an example):

@Override
public void readFields(DataInput input) throws IOException {
super.readFields(input);
init();
}



> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in 

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208145#comment-16208145
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user snakhoda-sfdc commented on the issue:

https://github.com/apache/phoenix/pull/275
  
@JamesRTaylor Thanks for your comments. I added two further commits:

199c389: This addresses your comment about the byte array comparison. You 
were right! I must have got confused earlier with what was being displayed on 
sqlline.py not matching the sort order.  I also added a formatter for 
PVarBinary because without it you simply get a Java hash code in sqlline.py 
which is hard to do anything with.

8cc2b5c: This adds the end-to-end tests you mentioned and also changes the 
unit test to use the hex representation of the byte array to make it easier to 
read.


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203947#comment-16203947
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r144620511
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,233 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
+
+   public CollationKeyFunction() {
+   }
+
+   public CollationKeyFunction(List children) throws 
SQLException {
+   super(children);
+   }
+
+   @Override
+   public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {
+   try {
+   String inputValue = getInputValue(tuple, ptr);
+   String localeISOCode = getLocaleISOCode(tuple, ptr);
+   Boolean useSpecialUpperCaseCollator = 
getUseSpecialUpperCaseCollator(tuple, ptr);
+   Integer collatorStrength = getCollatorStrength(tuple, 
ptr);
+   Integer collatorDecomposition = 
getCollatorDecomposition(tuple, ptr);
+
+   Locale locale = 
LocaleUtils.get().getLocaleByIsoCode(localeISOCode);
+   
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Locale: " + 
locale.toLanguageTag()));
+   }
+   
+   LinguisticSort linguisticSort = 
LinguisticSort.get(locale);
+
+   Collator collator = 
BooleanUtils.isTrue(useSpecialUpperCaseCollator)
+   ? 
linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator();
+
+   if (collatorStrength != null) {
+   collator.setStrength(collatorStrength);
+   }
+
+   if (collatorDecomposition != null) {
+   
collator.setDecomposition(collatorDecomposition);
+   }
+
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Collator: [strength: 

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203839#comment-16203839
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user shehzaadn-vd commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r144604412
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,233 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
+
+   public CollationKeyFunction() {
+   }
+
+   public CollationKeyFunction(List children) throws 
SQLException {
+   super(children);
+   }
+
+   @Override
+   public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {
+   try {
+   String inputValue = getInputValue(tuple, ptr);
+   String localeISOCode = getLocaleISOCode(tuple, ptr);
+   Boolean useSpecialUpperCaseCollator = 
getUseSpecialUpperCaseCollator(tuple, ptr);
+   Integer collatorStrength = getCollatorStrength(tuple, 
ptr);
+   Integer collatorDecomposition = 
getCollatorDecomposition(tuple, ptr);
+
+   Locale locale = 
LocaleUtils.get().getLocaleByIsoCode(localeISOCode);
+   
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Locale: " + 
locale.toLanguageTag()));
+   }
+   
+   LinguisticSort linguisticSort = 
LinguisticSort.get(locale);
+
+   Collator collator = 
BooleanUtils.isTrue(useSpecialUpperCaseCollator)
+   ? 
linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator();
+
+   if (collatorStrength != null) {
+   collator.setStrength(collatorStrength);
+   }
+
+   if (collatorDecomposition != null) {
+   
collator.setDecomposition(collatorDecomposition);
+   }
+
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Collator: [strength: 

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203799#comment-16203799
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r144600094
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,233 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
+
+   public CollationKeyFunction() {
+   }
+
+   public CollationKeyFunction(List children) throws 
SQLException {
+   super(children);
+   }
+
+   @Override
+   public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {
+   try {
+   String inputValue = getInputValue(tuple, ptr);
+   String localeISOCode = getLocaleISOCode(tuple, ptr);
+   Boolean useSpecialUpperCaseCollator = 
getUseSpecialUpperCaseCollator(tuple, ptr);
+   Integer collatorStrength = getCollatorStrength(tuple, 
ptr);
+   Integer collatorDecomposition = 
getCollatorDecomposition(tuple, ptr);
+
+   Locale locale = 
LocaleUtils.get().getLocaleByIsoCode(localeISOCode);
+   
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Locale: " + 
locale.toLanguageTag()));
+   }
+   
+   LinguisticSort linguisticSort = 
LinguisticSort.get(locale);
+
+   Collator collator = 
BooleanUtils.isTrue(useSpecialUpperCaseCollator)
+   ? 
linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator();
+
+   if (collatorStrength != null) {
+   collator.setStrength(collatorStrength);
+   }
+
+   if (collatorDecomposition != null) {
+   
collator.setDecomposition(collatorDecomposition);
+   }
+
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Collator: [strength: 

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203159#comment-16203159
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user snakhoda-sfdc commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r144483837
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,233 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
+
+   public CollationKeyFunction() {
+   }
+
+   public CollationKeyFunction(List children) throws 
SQLException {
+   super(children);
+   }
+
+   @Override
+   public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {
+   try {
+   String inputValue = getInputValue(tuple, ptr);
+   String localeISOCode = getLocaleISOCode(tuple, ptr);
+   Boolean useSpecialUpperCaseCollator = 
getUseSpecialUpperCaseCollator(tuple, ptr);
+   Integer collatorStrength = getCollatorStrength(tuple, 
ptr);
+   Integer collatorDecomposition = 
getCollatorDecomposition(tuple, ptr);
+
+   Locale locale = 
LocaleUtils.get().getLocaleByIsoCode(localeISOCode);
+   
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Locale: " + 
locale.toLanguageTag()));
+   }
+   
+   LinguisticSort linguisticSort = 
LinguisticSort.get(locale);
+
+   Collator collator = 
BooleanUtils.isTrue(useSpecialUpperCaseCollator)
+   ? 
linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator();
+
+   if (collatorStrength != null) {
+   collator.setStrength(collatorStrength);
+   }
+
+   if (collatorDecomposition != null) {
+   
collator.setDecomposition(collatorDecomposition);
+   }
+
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Collator: [strength: 

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202669#comment-16202669
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on the issue:

https://github.com/apache/phoenix/pull/275
  
Thanks for the patch, @shehzaadn. This looks like a general enough built-in 
function to include in Phoenix IMHO. See inline for more specific comments. 
It'd be much better to include the first two commits as external dependencies. 
If we don't do that, we'll need to quickly follow up with replacing them with 
external dependencies (and make sure we don't change those files at all).


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202667#comment-16202667
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r144416251
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,233 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
--- End diff --

We should include more comments here. In particular, what sort order will 
we get? Does this mimic some other databases behavior (i.e. Oracle)? Does it 
deviate from that at all? Does Oracle follow some standard that we could point 
to?

Also, please make sure to budget time to update our online reference 
manual: https://phoenix.apache.org/language/functions.html. This lives in 
phoenix.csv in our SVN repo as described here: 
https://phoenix.apache.org/building_website.html


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202662#comment-16202662
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r144415724
  
--- Diff: 
phoenix-core/src/test/java/org/apache/phoenix/expression/function/CollationKeyFunctionTest.java
 ---
@@ -0,0 +1,134 @@
+package org.apache.phoenix.expression.function;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+import java.sql.SQLException;
+import java.util.List;
+
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.function.CollationKeyFunction;
+import org.apache.phoenix.schema.SortOrder;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.expression.LiteralExpression;
+
+import org.junit.Test;
+
+import com.google.common.collect.Lists;
+
+/**
+ * "Unit" tests for CollationKeyFunction
+ * 
+ * @author snakhoda
+ *
+ */
+public class CollationKeyFunctionTest {
--- End diff --

We'll need more tests. You really want to test the sort order of a list of 
strings matches the expected linguistic sort order. These tests don't have a 
lot of meaning in terms of validating the sort order is correct IMHO.

We'll also want end2end tests that use the new function.


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202655#comment-16202655
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user joshelser commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r144414821
  
--- Diff: phoenix-core/src/main/java/com/force/db/i18n/OracleUpper.java ---
@@ -0,0 +1,66 @@
+/* 
--- End diff --

Yup! You got it right, James. Whether we include the code in binary form or 
source form, for BSD, we treat them the same (propagate in LICENSE, and 
copyright/etc in NOTICE). If there's a license header for the file, we would 
also leave that, IIRC.


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202651#comment-16202651
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r144414623
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/CollationKeyFunction.java
 ---
@@ -0,0 +1,233 @@
+package org.apache.phoenix.expression.function;
+
+import java.sql.SQLException;
+import java.text.Collator;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.commons.lang.BooleanUtils;
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.phoenix.expression.Expression;
+import org.apache.phoenix.parse.FunctionParseNode;
+import org.apache.phoenix.schema.tuple.Tuple;
+import org.apache.phoenix.schema.types.PBoolean;
+import org.apache.phoenix.schema.types.PDataType;
+import org.apache.phoenix.schema.types.PInteger;
+import org.apache.phoenix.schema.types.PIntegerArray;
+import org.apache.phoenix.schema.types.PUnsignedIntArray;
+import org.apache.phoenix.schema.types.PVarbinary;
+import org.apache.phoenix.schema.types.PVarchar;
+import org.apache.phoenix.schema.types.PhoenixArray;
+
+import com.force.db.i18n.LinguisticSort;
+import com.force.i18n.LocaleUtils;
+
+import com.ibm.icu.impl.jdkadapter.CollatorICU;
+import com.ibm.icu.util.ULocale;
+
+/**
+ * A Phoenix Function that calculates a collation key for an input string 
based
+ * on a caller-provided locale and collator strength and decomposition 
settings.
+ * 
+ * It uses the open-source grammaticus and i18n packages to obtain the 
collators
+ * it needs.
+ * 
+ * @author snakhoda
+ *
+ */
+@FunctionParseNode.BuiltInFunction(name = CollationKeyFunction.NAME, args 
= {
+   // input string
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }),
+   // ISO Code for Locale
+   @FunctionParseNode.Argument(allowedTypes = { PVarchar.class }, 
isConstant = true),
+   // whether to use special upper case collator
+   @FunctionParseNode.Argument(allowedTypes = { PBoolean.class }, 
defaultValue = "false", isConstant = true),
+   // collator strength
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true),
+   // collator decomposition
+   @FunctionParseNode.Argument(allowedTypes = { PInteger.class }, 
defaultValue = "null", isConstant = true) })
+public class CollationKeyFunction extends ScalarFunction {
+
+   private static final Log LOG = 
LogFactory.getLog(CollationKeyFunction.class);
+
+   public static final String NAME = "COLLKEY";
+
+   public CollationKeyFunction() {
+   }
+
+   public CollationKeyFunction(List children) throws 
SQLException {
+   super(children);
+   }
+
+   @Override
+   public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {
+   try {
+   String inputValue = getInputValue(tuple, ptr);
+   String localeISOCode = getLocaleISOCode(tuple, ptr);
+   Boolean useSpecialUpperCaseCollator = 
getUseSpecialUpperCaseCollator(tuple, ptr);
+   Integer collatorStrength = getCollatorStrength(tuple, 
ptr);
+   Integer collatorDecomposition = 
getCollatorDecomposition(tuple, ptr);
+
+   Locale locale = 
LocaleUtils.get().getLocaleByIsoCode(localeISOCode);
+   
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Locale: " + 
locale.toLanguageTag()));
+   }
+   
+   LinguisticSort linguisticSort = 
LinguisticSort.get(locale);
+
+   Collator collator = 
BooleanUtils.isTrue(useSpecialUpperCaseCollator)
+   ? 
linguisticSort.getUpperCaseCollator(false) : linguisticSort.getCollator();
+
+   if (collatorStrength != null) {
+   collator.setStrength(collatorStrength);
+   }
+
+   if (collatorDecomposition != null) {
+   
collator.setDecomposition(collatorDecomposition);
+   }
+
+   if(LOG.isDebugEnabled()) {
+   LOG.debug(String.format("Collator: [strength: 

[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202645#comment-16202645
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/275#discussion_r144413717
  
--- Diff: phoenix-core/src/main/java/com/force/db/i18n/OracleUpper.java ---
@@ -0,0 +1,66 @@
+/* 
--- End diff --

@joshelser - my take, based on this[1], is that it's ok to include source 
code in an ASF project with a BSD license (as opposed to only having BSD 
licensed software as an external dependency). WDYT?

[1] 
http://apache.org/licenses/#code-developed-elsewhere-received-under-a-category-a-license-incorporated-into-apache-projects-distributed-by-apache-and-licensed-to-downstream-users-under-its-original-license


> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
> Fix For: 4.12.0
>
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4237) Allow sorting on (Java) collation keys for non-English locales

2017-10-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192500#comment-16192500
 ] 

ASF GitHub Bot commented on PHOENIX-4237:
-

GitHub user shehzaadn opened a pull request:

https://github.com/apache/phoenix/pull/275

PHOENIX-4237: Add function to calculate Java collation keys

Here we implement a generalized solution for calculating Java collation 
keys by creating Java collators based on a user locale. These collation keys 
can then be used in an ORDER BY clause to sort strings in a 
natural-language-appropriate way. We add a new Phoenix function COLLKEY. In 
general usage for this function will be:

select name from my_table order by COLLKEY(name, 'zh_TW')

We use artifacts from the ICU4J project and recently open-sourced 
grammaticus project (by Maven dependency). We were forced to include some code 
from ICU4J because some jars produced by that project aren't published in 
Maven. We also include code from Salesforce that has been licensed for 
open-source release but not yet published as artifacts in maven.

There are three commits that split the changes into three logical pieces:

1) f8cb121: Add the external source code described above
2) fdbb5e0: Make changes needed to the Phoenix license due to the above 
(and fix to what seems to be an existing bug) 
3) 98cfc10: The actual function implementation of COLLKEY - new code that 
uses the code introduced above and newly introduced dependencies via maven.

Thanks in advance to the Phoenix community for your feedback on this.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shehzaadn/phoenix master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/phoenix/pull/275.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #275


commit f8cb121145163591345eea70acbc313098e23e21
Author: Shehzaad 
Date:   2017-09-30T01:52:46Z

(1) add ICU4J source code for charset/localespi jars and (2) add Salesforce 
i18n-util source code

commit fdbb5e009a767e0f6df385dc9a1a8472b32cc361
Author: Shehzaad 
Date:   2017-10-02T17:55:39Z

(1) Fix text of 3-clause BSD License, (2) add Unicode license, (3) add 
mention of bundling ICU4J and i18n-util code

commit 98cfc10bac3c48ec3e7ceb47bea0b60556265c85
Author: Shehzaad 
Date:   2017-10-02T21:58:31Z

add function COLLKEY to Phoenix to calculate a Java collation key on a 
given string with the collator derived from an ISO locale code and some other 
parameters




> Allow sorting on (Java) collation keys for non-English locales
> --
>
> Key: PHOENIX-4237
> URL: https://issues.apache.org/jira/browse/PHOENIX-4237
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Shehzaad Nakhoda
>
> Strings stored via Phoenix can be composed from a subset of the entire set of 
> Unicode characters. The natural sort order for strings for different 
> languages often differs from the order dictated by the binary representation 
> of the characters of these strings. Java provides the idea of a Collator 
> which given an input string and a (language) locale can generate a Collation 
> Key which can then be used to compare strings in that natural order.
> Salesforce has recently open-sourced grammaticus. IBM has open-sourced ICU4J 
> some time ago. These technologies can be combined to provide a robust new 
> Phoenix function that can be used in an ORDER BY clause to sort strings 
> according to the user's locale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)