[ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008079#comment-16008079
 ] 

ASF GitHub Bot commented on JENA-1313:
--------------------------------------

Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116218340
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java 
---
    @@ -783,6 +772,22 @@ private static int compare(NodeValue nv1, NodeValue 
nv2, boolean sortOrderingCom
                         return Expr.CMP_GREATER ;
                     return Expr.CMP_EQUAL;  // Both plain or both xsd:string.
                 }
    +            case VSPACE_SORTKEY :
    +            {
    +                int cmp = 0;
    +                String c1 = nv1.getCollation();
    +                String c2 = nv2.getCollation();
    +                if (c1 != null && c2 != null && c1.equals(c2)) {
    +                    // locales are parsed. Here we could think about 
caching if necessary
    +                    Locale desiredLocale = Locale.forLanguageTag(c1);
    +                    // collators are already stored in a concurrent map by 
the JVM, with <locale, softref<collator>>
    +                    Collator collator = 
Collator.getInstance(desiredLocale);
    +                    cmp = collator.compare(nv1.getString(), 
nv2.getString());
    +                } else {
    --- End diff --
    
    Would it be better to have `NodeSortKey` being comparable rather than 
putting the comparison here?
    
    The then removes the need for `getCollation`.
    
    And, as a general mechanism, `NodeSortKey` are not restricted to 
language-based collection.  Maybe an extension is for `NodeSortKey` based on an 
enum-like interpretation of a value (e.g. a persons title or rank or even 
"one", "two", "three").


> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to