[
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996521#comment-15996521
]
ASF GitHub Bot commented on JENA-1313:
--------------------------------------
Github user kinow commented on the issue:
https://github.com/apache/jena/pull/237
> Doesn't this run into the unstable sort issue that @afs cautioned against?
Not sure. I think not because of this approach, but I tried to find if sort
could be unstable, and think I found one case.
> I think it could be avoided by the following logic: If two
`NodeValueSortKey`s have different collation languages, sort them by the
collation languages instead of even looking at the text.
>
>This is the (lang, lex) approach discussed earlier, just applied in a
slightly different context.
Sounds like a plan. Let's wait and see what other think.
Now, on stability...
I tried finding ways that the sort would be unstable, but for two values A
and B, with same collation, the result would be stable. For two values C and D
with different collations, or missing collations, the result would be the sort
by the string literal. The node produced would be a `Node_Literal` (function
rewrites any node given to it as a `Node_Literal`).
Now here is the interesting part. `#equals(Object)` and `#hashcode()` use
the node value, i.e. the `Node_Literal` string to compare values. Using the
approach suggested by @osma `NodeValue#compare(NodeValue, NodeValue)` for
`NodeValueSortKey`("Casa", "es") and `NodeValueSortKey`("Casa", "pt") would
return that `NodeValueSortKey`("Casa", "es") < `NodeValueSortKey`("Casa",
"pt"). i.e. since both values have different collation language tags, we would
compare "es" and "pt".
However, `#equals(Object)` and `#hashcode()` would report true based only
on the `Node_Literal` node. So `NodeValueSortKey`("Casa", "es").equals(
`NodeValueSortKey`("Casa", "pt") ) would return true.
I believe this could cause problems, where the merge-sort sort would be
stable (I think), but using the elements (sorted or not) in a map/set could
result in weird behaviours...
Some code to illustrate the above stated:
```
NodeValueSortKey nvsk1 = new NodeValueSortKey("Casa", "es");
NodeValueSortKey nvsk2 = new NodeValueSortKey("Casa", "pt");
System.out.println(nvsk1.equals(nvsk2));
// true
NodeValueLang nvl1 = new NodeValueLang("Casa", "es");
NodeValueLang nvl2 = new NodeValueLang("Casa", "pt");
System.out.println(nvl1.equals(nvl2));
// false
```
For `NodeValueLang`s, when a `Node_Literal` is created, it is given a
`LiteralLabel` object that it wraps. Then, when you call
`NodeValueLang#equals(Object)`, `NodeValueLang` uses the
`LiteralLabel#equals(Object)` to compare the other `NodeValueLang`.
`LiteralLabel` is checking the language tag.
I wonder if we should create a new `Node_Concrete` implementation in
`org.apache.jena.graph` (Node_SortKey?), or if we should modify
`Node_Literal`... I feel like the latter would be less elegant than the former.
By using the current implementation, plus @osma's suggestion of comparing
the collation language tag, and finally by making sure equals/hashcode agree
with what our comparable says; then I believe we would have a stable sort.
Thoughts?
> Language-specific collation in ARQ
> ----------------------------------
>
> Key: JENA-1313
> URL: https://issues.apache.org/jira/browse/JENA-1313
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ
> Affects Versions: Jena 3.2.0
> Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users
> mailing list in October 2016, I would like to change ARQ collation of literal
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
> method.
> It currently sorts by lexical value first, then by language tag. Since the
> collation order needs to be stable across all possible literal values, I
> think the safest way would be to sort by language tag first, then by lexical
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different
> collation rules than the main language? It would be a bit strange if all
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in
> implementing it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)