[
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953256#comment-15953256
]
Andy Seaborne commented on JENA-1313:
-------------------------------------
Is this summary correct and complete?
The current collation is:
# Order by lexical form (unicode codepoints)
# then by language tag (case insensitive)
# then language tag (case sensitive).
and it is proposed that:
# Order by language tag, then order by lexical form.
# Same ordering for language tags: case insensitive then case sensitive.
# Same locale: Order within the same language tag (case insensitive) is by
locale
# Unknown locale: (e.g. unknown language tag), use codepoint ordering.
What about comparison? "{{<}}" , in {{NodeValue.compare}}, which is different
because a comparison may be undefined, whereas sorting is always defined even
if arbitrary.
The proposal is to change {{NodeUtils.compareLiteralsBySyntax}} but it looks
better to me to change {{NodeValue.compare(NodeValue nv1, NodeValue nv2,
boolean sortOrderingCompare)}} for the {{VSPACE_LANG}} when the language tags
are the same language.
i.e. treat language tags as defining a value space which is ordered by locale
collation.
Beware of cross language tag instability: [see this
message|http://markmail.org/message/ig4w7wqkxsssgqdt]. That is not to say don't
do it, just beware of nasty corner cases.
Treating as values within the same locale should get round this.
Overall, I prefer that there is a switch to put ARQ into "locale-mode".
> Language-specific collation in ARQ
> ----------------------------------
>
> Key: JENA-1313
> URL: https://issues.apache.org/jira/browse/JENA-1313
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ
> Affects Versions: Jena 3.2.0
> Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users
> mailing list in October 2016, I would like to change ARQ collation of literal
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
> method.
> It currently sorts by lexical value first, then by language tag. Since the
> collation order needs to be stable across all possible literal values, I
> think the safest way would be to sort by language tag first, then by lexical
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different
> collation rules than the main language? It would be a bit strange if all
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in
> implementing it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)