[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

ASF GitHub Bot (JIRA) Fri, 12 May 2017 16:48:20 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008925#comment-16008925
 ]


ASF GitHub Bot commented on JENA-1313:
--------------------------------------

Github user kinow commented on a diff in the pull request:

    https://github.com/apache/jena/pull/237#discussion_r116342551
  
    --- Diff: jena-arq/src/main/java/org/apache/jena/sparql/expr/NodeValue.java 
---
    @@ -783,6 +772,22 @@ private static int compare(NodeValue nv1, NodeValue 
nv2, boolean sortOrderingCom
                         return Expr.CMP_GREATER ;
                     return Expr.CMP_EQUAL;  // Both plain or both xsd:string.
                 }
    +            case VSPACE_SORTKEY :
    +            {
    +                int cmp = 0;
    +                String c1 = nv1.getCollation();
    +                String c2 = nv2.getCollation();
    +                if (c1 != null && c2 != null && c1.equals(c2)) {
    +                    // locales are parsed. Here we could think about 
caching if necessary
    +                    Locale desiredLocale = Locale.forLanguageTag(c1);
    +                    // collators are already stored in a concurrent map by 
the JVM, with <locale, softref<collator>>
    +                    Collator collator = 
Collator.getInstance(desiredLocale);
    +                    cmp = collator.compare(nv1.getString(), 
nv2.getString());
    +                } else {
    --- End diff --
    
    What a great idea being comparable! Done, it reduced changes in NodeValue, 
and made writing tests for the comparison (core part of this change) easier. 
Thanks Andy!
    
    As for the general mechanism, I'd be +1, but maybe later I think. The class 
is final for now and has a few comments. If I understand it correctly, someone 
in the future may remove the final mark, remove the collation and move it to 
subtype/enums. This way we could re-use `NodeSortKey` later for comparing other 
things. One could even write a simple function that would compare values by 
some string edit distance.


> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Reply via email to