[
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950983#comment-15950983
]
Bruno P. Kinoshita commented on JENA-1313:
------------------------------------------
Hi Osma,
>> return values in es, then pt, then fi
>Shouldn't this be es, fi, pt?
Ops, my mistake. Fixed my comment, also attaching a screen shot
(collate-result-no-filter-fullpage.png) with the output in Fuseki.
>Dydra also has a test suite for collation that is published using the
>Unlicense i.e. placed in the public domain. There I found an example of a
>proper Danish language collation sequence that could perhaps also be used as a
>test case. There are other test cases in that directory that may also be
>relevant.
Used a few words from Dydra Danish test case:
- Broager
- Brædstrup
- Børkop
- Wandsbek
- Ærøskøbing
- Åkirkeby
The last two get switched in Jena. ICU gives me the same order as in the Dydra
test for the two Danish collation schemes available.
>For the extension function, I suggest defining it just as a single function
>e.g. collate:collate that takes up to two parameters: the literal value and
>the language/locale (which may be omitted, and then the language is extracted
>from the language tag)
Good points. Would we need a third argument, to define the sorting order as ASC
or DESC too? Or would we use ASC(collate:collate(?label, 'fi'))?
> Language-specific collation in ARQ
> ----------------------------------
>
> Key: JENA-1313
> URL: https://issues.apache.org/jira/browse/JENA-1313
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ
> Affects Versions: Jena 3.2.0
> Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users
> mailing list in October 2016, I would like to change ARQ collation of literal
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
> method.
> It currently sorts by lexical value first, then by language tag. Since the
> collation order needs to be stable across all possible literal values, I
> think the safest way would be to sort by language tag first, then by lexical
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different
> collation rules than the main language? It would be a bit strange if all
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in
> implementing it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)