[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Osma Suominen (JIRA) Mon, 03 Apr 2017 04:35:59 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953327#comment-15953327
 ]


Osma Suominen commented on JENA-1313:
-------------------------------------

[~andy.seaborne] Why map to numbers? In my understanding, collation functions 
typically produce keys that are byte strings. See e.g. the [ICU 
documentation|http://userguide.icu-project.org/collation]:

{quote}
The basic ICU Collation Service is provided by two main categories of APIs:

    String comparison - most commonly used: APIs return result of comparing two 
strings (greater than, equal or less than). This is used as a comparator when 
sorting lists, building tree maps, etc.

    Sort key generation - used when a very large set of strings are 
compared/sorted repeatedly: APIs return a zero-terminated array of bytes per 
string known as a sort key. The keys can be compared directly using strcmp or 
memcmp standard library functions, saving repeated lookup and computation of 
each string's collation properties. For example, database applications use 
index tables of sort keys to index strings quickly. Note, however, that this 
only improves performance for large numbers of strings because sorting via the 
comparison functions is very fast. For more information, see Sortkeys vs 
Comparison.
{quote}


> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Reply via email to