[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

ASF GitHub Bot (JIRA) Thu, 04 May 2017 03:36:08 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996521#comment-15996521
 ]


ASF GitHub Bot commented on JENA-1313:
--------------------------------------

Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    > Doesn't this run into the unstable sort issue that @afs cautioned against?
    
    Not sure. I think not because of this approach, but I tried to find if sort 
could be unstable, and think I found one case.
    
    > I think it could be avoided by the following logic: If two 
`NodeValueSortKey`s have different collation languages, sort them by the 
collation languages instead of even looking at the text.
    >
    >This is the (lang, lex) approach discussed earlier, just applied in a 
slightly different context.
    
    Sounds like a plan. Let's wait and see what other think.
    
    Now, on stability...
    
    I tried finding ways that the sort would be unstable, but for two values A 
and B, with same collation, the result would be stable. For two values C and D 
with different collations, or missing collations, the result would be the sort 
by the string literal. The node produced would be a `Node_Literal` (function 
rewrites any node given to it as a `Node_Literal`).
    
    Now here is the interesting part. `#equals(Object)` and `#hashcode()` use 
the node value, i.e.  the `Node_Literal` string to compare values. Using the 
approach suggested by @osma `NodeValue#compare(NodeValue, NodeValue)` for 
`NodeValueSortKey`("Casa", "es") and `NodeValueSortKey`("Casa", "pt") would 
return that `NodeValueSortKey`("Casa", "es") < `NodeValueSortKey`("Casa", 
"pt"). i.e. since both values have different collation language tags, we would 
compare "es" and "pt".
    
    However, `#equals(Object)` and `#hashcode()` would report true based only 
on the `Node_Literal` node. So `NodeValueSortKey`("Casa", "es").equals( 
`NodeValueSortKey`("Casa", "pt") ) would return true.
    
    I believe this could cause problems, where the merge-sort sort would be 
stable (I think), but using the elements (sorted or not) in a map/set could 
result in weird behaviours...
    
    Some code to illustrate the above stated:
    
    ```
    NodeValueSortKey nvsk1 = new NodeValueSortKey("Casa", "es");
    NodeValueSortKey nvsk2 = new NodeValueSortKey("Casa", "pt");
    System.out.println(nvsk1.equals(nvsk2));
    // true
    
    NodeValueLang nvl1 = new NodeValueLang("Casa", "es");
    NodeValueLang nvl2 = new NodeValueLang("Casa", "pt");
    System.out.println(nvl1.equals(nvl2));
    // false
    ```
    
    For `NodeValueLang`s, when a `Node_Literal` is created, it is given a 
`LiteralLabel` object that it wraps. Then, when you call 
`NodeValueLang#equals(Object)`, `NodeValueLang` uses the 
`LiteralLabel#equals(Object)` to compare the other `NodeValueLang`. 
`LiteralLabel` is checking the language tag.
    
    I wonder if we should create a new `Node_Concrete` implementation in 
`org.apache.jena.graph` (Node_SortKey?), or if we should modify 
`Node_Literal`... I feel like the latter would be less elegant than the former.
    
    By using the current implementation, plus @osma's suggestion of comparing 
the collation language tag, and finally by making sure equals/hashcode agree 
with what our comparable says; then I believe we would have a stable sort. 
Thoughts?


> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Reply via email to