[
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967901#comment-15967901
]
ASF GitHub Bot commented on JENA-1313:
--------------------------------------
Github user osma commented on the issue:
https://github.com/apache/jena/pull/237
I've been thinking about this, and I can't see how this could produce a
usable order when there are several language tags (even subtags) involved. For
example, in a multilingual SKOS thesaurus, it's quite likely that there are
`en-US` and `en-GB` labels mixed together (I know at least a couple of thesauri
that do this in their SKOS files). Think about e.g.
```
ex:zea_mays a skos:Concept ;
skos:prefLabel "corn"@en-US, "maize"@en-GB .
ex:coffee a skos:Concept ;
skos:prefLabel "coffee"@en-US, "coffee"@en-GB .
```
Now an ORDER BY that sorts primarily by language tag, then by
language-specific collation rules, would order these skos:prefLabels as:
1. `"coffee"@en-GB`
1. `"maize"@en-GB`
1. `"coffee"@en-US`
1. `"corn"@en-US`
I have a hard time seeing how this order would be useful to anyone. These
are all English language words; as a user I don't care whether they are sorted
by GB or US collation rules (even if they differed, as in fr-CA vs. fr-FR), but
this is clearly worse than the current behavior which sorts first by lexical
value, then by language tag.
My conclusion of this thought experiment is that there should be a way to
specify the collation order in the ORDER BY statement independent of the
language tags of the literals being sorted.
> Language-specific collation in ARQ
> ----------------------------------
>
> Key: JENA-1313
> URL: https://issues.apache.org/jira/browse/JENA-1313
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ
> Affects Versions: Jena 3.2.0
> Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users
> mailing list in October 2016, I would like to change ARQ collation of literal
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
> method.
> It currently sorts by lexical value first, then by language tag. Since the
> collation order needs to be stable across all possible literal values, I
> think the safest way would be to sort by language tag first, then by lexical
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different
> collation rules than the main language? It would be a bit strange if all
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in
> implementing it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)