[jira] [Comment Edited] (JENA-1313) Language-specific collation in ARQ

Andy Seaborne (JIRA) Thu, 30 Mar 2017 06:17:29 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949031#comment-15949031
 ]


Andy Seaborne edited comment on JENA-1313 at 3/30/17 1:16 PM:
--------------------------------------------------------------

[~osma] but what is the proposal? 

As per [the message of Oct 25, 
2016|http://markmail.org/message/ig4w7wqkxsssgqdt], the "make it respect 
language-specific collation rules" is a general direction which leaves open 
problems that need addressing.

Possibilities include
1. making it for work for one language - put the system into "@fi" mode whereby 
all comparison (sort and i{{<}}) , by @someLang rules for xsd:string and 
@someLang literals.  Good for Fuseki,
1. allowing a plug-in comparision function - i.e. in the Context - so the 
default is the current collation-stable code.   Hence per query setting. Can't 
work with Fuseki as it is today.
1. A library of collation functions {{ORDER BY collate:collate(?value)}} and 
{{collate:compare(?value1,?value2)}} -  [~rvesse]'s example.

It can be done now with functions {{collate:....}} that maps the value on to 
the integers. This will give a chance to explore the design space.

{{xsd:string}} is difficult - how one client of Fuseki wants the answers may be 
different to how another client wants them if they are from different parts of 
the world and hence different local languages. This is a very real issue for me.

What do other multi-language systems like xquery do? Other triple stores (no 
need to be different just because we didn't look!).

[NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
Note the contract "Gives a deterministic, stable, arbitrary ordering between 
unrelated literals.". An unstable ordering will cause queries to crash - the 
Java sort code will throw exceptions on unstable comparators.


was (Author: andy.seaborne):
[~osma] but what is the proposal? 

As per [the message of Oct 25, 
2016|http://markmail.org/message/ig4w7wqkxsssgqdt], the "make it respect 
language-specific collation rules" is a general direction which leaves open 
problems that need addressing.

Possibilities include
1. making it for work for one language - put the system into "@fi" mode whereby 
all comparison (sort and i{{<}}) , by @someLang rules for xsd:string and 
@someLang literals.  Good for Fuseki,
1. allowing a plug-in comparision function - i.e. in the Context - so the 
default is the current collation-stable code.   Hence per query setting. Can't 
work with Fuseki as it is today.
1. A library of collation functions {{ORDER BY collate:collate(?value)}} and 
{{collate:compare(?value1,?value2)}} -  [~rvesse]'s example.

It can be done now with functions {{collate:....}} that maps the value on to 
the integers. This will give a chance to explore the design space.

{{xsd:string}} is difficult - how one client of Fuseki wants the answers may be 
different to how another client wants them if they are from different parts of 
the world and hence different local languages. This is a very real issue for me.

What do other multi-language systems like xquery do?

[NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
Note the contract "Gives a deterministic, stable, arbitrary ordering between 
unrelated literals.". An unstable ordering will cause queries to crash - the 
Java sort code will throw exceptions on unstable comparators.

> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (JENA-1313) Language-specific collation in ARQ

Reply via email to