[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

ASF GitHub Bot (JIRA) Thu, 13 Apr 2017 07:27:57 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967644#comment-15967644
 ]


ASF GitHub Bot commented on JENA-1313:
--------------------------------------

Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    >I agree with the other commenters, the general order should be (lang, lex) 
to avoid potentially inconsistent ordering.
    
    Ack, that makes sense +1
    
    >Also the language tag may not match any Locale. We also need to have unit 
tests that verify that the code works in corner cases like this.
    
    Sure, tests and more defensive programming will come later. Right now 
looking more for comments on how to sort, where to sort, etc.
    
    Besides typos/mispellings, there are also valid tags such as i-klingon (I 
believe this is mentioned in some specification linked in the SPARQL spec 
page). For cases like this I think we would simply try to match against the 
JVM's available locales, and if not existing, then just use normal string 
comparison.
    
    >But what about subtags like en-US and en-GB? If the language tag is the 
primary sort key, then all en-GB values would sort before "a"@en-US, which I 
think would be confusing for most users.
    The sort order and collation locale could be based on just the main tag (en 
in this case) ignoring the subtags, but I'm quite sure there is some language 
subtag out there in the world that requires a different collation order from 
that of the main language...
    
    The sort order of accented letters is different for en-CA and en-FR.
    
    en-FR:
    
    * cote
    * coté
    * côte
    * côté
    
    en-CA:
    
    * cote
    * côte
    * coté
    * côté



> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Reply via email to