[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Osma Suominen (JIRA) Thu, 30 Mar 2017 06:29:58 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949057#comment-15949057
 ]


Osma Suominen commented on JENA-1313:
-------------------------------------

Thank you everyone for the input. Sorry for not being specific enough.

My initial proposal was/is to change NodeUtils.compareLiteralsBySyntax so that 
it uses a different sorting priority for plain or language-tagged literals than 
currently:
1. If language tags differ, order by language tag (lexical comparison of 
language tags). Literals without language tag (includes {{xsd:string}}) can go 
first (or last).
2. When language tags are the same, sort the lexical values according to the 
collation rules of the language identified by the language tag.

In my understanding, this would fulfill the "deterministic, stable, arbitrary 
ordering between unrelated literals" contract, just in a different way than the 
current implementation that orders first by lexical value, then by language tag.

But if this is too disruptive, or otherwise has undesirable effects, then I 
think that a {{collate:collate}} function would be just as good. It would 
extract the language tag from the value (or perhaps take it as an additional 
parameter) and map into integers (or strings) that can be sorted in the usual 
way. Thus it could be used in an ORDER BY statement, as in Rob's example.

> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Reply via email to