[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

kinow Thu, 04 May 2017 03:34:41 -0700

Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/237
  
    > Doesn't this run into the unstable sort issue that @afs cautioned against?
    
    Not sure. I think not because of this approach, but I tried to find if sort 
could be unstable, and think I found one case.
    
    > I think it could be avoided by the following logic: If two 
`NodeValueSortKey`s have different collation languages, sort them by the 
collation languages instead of even looking at the text.
    >
    >This is the (lang, lex) approach discussed earlier, just applied in a 
slightly different context.
    
    Sounds like a plan. Let's wait and see what other think.
    
    Now, on stability...
    
    I tried finding ways that the sort would be unstable, but for two values A 
and B, with same collation, the result would be stable. For two values C and D 
with different collations, or missing collations, the result would be the sort 
by the string literal. The node produced would be a `Node_Literal` (function 
rewrites any node given to it as a `Node_Literal`).
    
    Now here is the interesting part. `#equals(Object)` and `#hashcode()` use 
the node value, i.e.  the `Node_Literal` string to compare values. Using the 
approach suggested by @osma `NodeValue#compare(NodeValue, NodeValue)` for 
`NodeValueSortKey`("Casa", "es") and `NodeValueSortKey`("Casa", "pt") would 
return that `NodeValueSortKey`("Casa", "es") < `NodeValueSortKey`("Casa", 
"pt"). i.e. since both values have different collation language tags, we would 
compare "es" and "pt".
    
    However, `#equals(Object)` and `#hashcode()` would report true based only 
on the `Node_Literal` node. So `NodeValueSortKey`("Casa", "es").equals( 
`NodeValueSortKey`("Casa", "pt") ) would return true.
    
    I believe this could cause problems, where the merge-sort sort would be 
stable (I think), but using the elements (sorted or not) in a map/set could 
result in weird behaviours...
    
    Some code to illustrate the above stated:
    
    ```
    NodeValueSortKey nvsk1 = new NodeValueSortKey("Casa", "es");
    NodeValueSortKey nvsk2 = new NodeValueSortKey("Casa", "pt");
    System.out.println(nvsk1.equals(nvsk2));
    // true
    
    NodeValueLang nvl1 = new NodeValueLang("Casa", "es");
    NodeValueLang nvl2 = new NodeValueLang("Casa", "pt");
    System.out.println(nvl1.equals(nvl2));
    // false
    ```
    
    For `NodeValueLang`s, when a `Node_Literal` is created, it is given a 
`LiteralLabel` object that it wraps. Then, when you call 
`NodeValueLang#equals(Object)`, `NodeValueLang` uses the 
`LiteralLabel#equals(Object)` to compare the other `NodeValueLang`. 
`LiteralLabel` is checking the language tag.
    
    I wonder if we should create a new `Node_Concrete` implementation in 
`org.apache.jena.graph` (Node_SortKey?), or if we should modify 
`Node_Literal`... I feel like the latter would be less elegant than the former.
    
    By using the current implementation, plus @osma's suggestion of comparing 
the collation language tag, and finally by making sure equals/hashcode agree 
with what our comparable says; then I believe we would have a stable sort. 
Thoughts?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] jena issue #237: JENA-1313: compare using a Collator when both literals are ...

Reply via email to