[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Osma Suominen (JIRA) Mon, 03 Apr 2017 06:47:15 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953502#comment-15953502
 ]


Osma Suominen commented on JENA-1313:
-------------------------------------

{quote}
I have no idea how collate:collate(?value, ?lang) would work except it is 
almost certain to have instability problems comparing across languages in a 
single sort. See my example in the message of Oct 25, 2016. Java "sort" can 
throw an exception because the comparator contract is broken.
{quote}

My perhaps naive inclination would be to implement this function like this:
1. Create a Collator using the supplied {{?lang}} as the locale
2. Using the Collator, convert the {{?value}} to a byte string

Even when called several times with different {{?lang}} parameters, this should 
still be deterministic and produce a set of byte strings that can be compared. 
Now the order between byte strings generated using different {{?lang}} values 
may not make sense to a human, but it should still be consistent, since in the 
end we are comparing the byte strings and they are conceptually just large 
numbers which have a well-defined order. So no contracts should be broken nor 
exceptions thrown.

By the way, the parameters could just as well be swapped, i.e. 
{{collate:collate(?lang, ?value)}}, to make it more explicit that the language 
is more "fundamental" here.

> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Reply via email to