[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Bruno P. Kinoshita (JIRA) Wed, 12 Apr 2017 05:58:00 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965786#comment-15965786
 ]


Bruno P. Kinoshita commented on JENA-1313:
------------------------------------------

Thanks for sharing the test case Osma.

Had some spare time today, so decided to give it a try. After some debugging, I 
think I can understand what Andy is saying about the values for the 
collate:collate function. I started writing an ARQ function (extending 
FunctionBase1, then later FunctionBase2)... but didn't even use it.

By using the approach of passing this as an expr for ORDER BY, it becomes (as 
far as I could tell) a SortCondition, which wraps the expression. Then, later, 
it gets invoked in the BindingComparator#compare, where the expression is 
applied to each value in the comparison pair.

Changing the BindingComparator would require changing the NodeValue or 
NodeUtils... so I couldn't find a clear way to write the function yet.

But then, reading the Dydra post again, I decided to follow their approach, and 
change the default behaviour. Here's the 
[PR](https://github.com/apache/jena/pull/237).

That would change the behaviour only for comparison of literals with the same 
tag. Here's the output of the Finnish words from a query in Fuseki:

{noformat}
* "tsahurin kieli"@fi
* "tšekin kieli"@fi
* "tšekin kieli"@fi
* "tulun kieli"@fi
* "töyhtöhyyppä"@fi
* "töyhtöhyyppä"@fi
{noformat}

And the Danish output (matches ICU and Dydra output):

{noformat}
* "Broager"@da
* "Brædstrup"@da
* "Børkop"@da
* "Wandsbek"@da
* "Ærøskøbing"@da
* "Åkirkeby"@da
{noformat}

While it doesn't provide a function that could be used to force a collation 
when not specified or to overwrite the collation locale; maybe this could still 
be part of the fix?

> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Reply via email to