[jira] [Comment Edited] (JENA-1313) Language-specific collation in ARQ

Bruno P. Kinoshita (JIRA) Fri, 31 Mar 2017 07:02:12 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949989#comment-15949989
 ]


Bruno P. Kinoshita edited comment on JENA-1313 at 3/31/17 2:01 PM:
-------------------------------------------------------------------

I liked Rob's suggestion too, of using a collate extension function. I used the 
books.ttl sample turtle file from Jena, simplifying to contain a few words in 
Finnish and in Portuguese:

{noformat}
@prefix dc:        <http://purl.org/dc/elements/1.1/> .

@prefix :          <http://example.org/data/> .

# cases:
# - repeated values
# - nodes missing translation (i.e. missing translations for one or more 
languages)
# - literals in multiple languages (good for filter(lang()))

# Finnish words in expected ASC rder:
# - tsahurin kieli
# - tšekin kieli
# - tulun kieli
# - töyhtöhyyppä

# Brazilian Portuguese words in expected ASC order:
# - cote
# - coté
# - côte
# - côté

# Spanish words in expected ASC order:
# - ano
# - anos
# - año

:en1
    dc:title    "tsahurin kieli"@fi ;
    dc:title    "cote"@pt ;
    dc:title    "ano"@es ;
    .
    
:en2
    dc:title    "tulun kieli"@fi ;
    dc:title    "coté"@pt ;
    dc:title    "anos"@es ;
    .
    
:en3
    dc:title    "töyhtöhyyppä"@fi ;
    dc:title    "côté"@pt ;
    dc:title    "año"@es ;
    .
    
:en4
    dc:title    "tšekin kieli"@fi ;
    dc:title    "côte"@pt ;
    .

:en5
    dc:title    "töyhtöhyyppä"@fi ;
    dc:title    "côté"@pt ;
    .
    
:en6
    dc:title    "tšekin kieli"@fi ;
    dc:title    "ano"@es ;
    .

# ref: http://markmail.org/message/ig4w7wqkxsssgqdt
# ref: https://issues.apache.org/jira/browse/JENA-1313
# ref: http://demo.icu-project.org
{noformat}

Here's the query I am using to test it:

{code}
# see NodeUtils#compareLiteralsBySyntax and StrUtils#strCompare
SELECT * WHERE
{
  ?a ?b ?title
  FILTER(LANG(?title) = "fi")
}
ORDER BY ASC(?title)
{code}

For Portuguese and Spanish it seems to work fine (will play with some example 
values later). When you don't filter by language, then it does as I think Andy 
suggested, sorting by language first (i.e. return values in es, then fi, then 
pt).

So I think with the filter and the collate function, this query would bring the 
correct values for Finnish:

{code}
# should we use a different prefix?
PREFIX collate: <http://jena.hpl.hp.com/ARQ/collate#>

SELECT * WHERE
{
  ?a ?b ?title
  FILTER(LANG(?title) = "fi")
}
# something like this?
ORDER BY ASC(collate:fi(?title))
{code}

Though the behaviour when you have multiple languages, or no language at all 
would still need to be defined. I guess the default could be as-is now. And for 
multiple languages, keep sorting per language, but try to use collate:$LANG if 
it exists.


was (Author: kinow):
I liked Rob's suggestion too, of using a collate extension function. I used the 
books.ttl sample turtle file from Jena, simplifying to contain a few words in 
Finnish and in Portuguese:

{noformat}
@prefix dc:        <http://purl.org/dc/elements/1.1/> .

@prefix :          <http://example.org/data/> .

# cases:
# - repeated values
# - nodes missing translation (i.e. missing translations for one or more 
languages)
# - literals in multiple languages (good for filter(lang()))

# Finnish words in expected ASC rder:
# - tsahurin kieli
# - tšekin kieli
# - tulun kieli
# - töyhtöhyyppä

# Brazilian Portuguese words in expected ASC order:
# - cote
# - coté
# - côte
# - côté

# Spanish words in expected ASC order:
# - ano
# - anos
# - año

:en1
    dc:title    "tsahurin kieli"@fi ;
    dc:title    "cote"@pt ;
    dc:title    "ano"@es ;
    .
    
:en2
    dc:title    "tulun kieli"@fi ;
    dc:title    "coté"@pt ;
    dc:title    "anos"@es ;
    .
    
:en3
    dc:title    "töyhtöhyyppä"@fi ;
    dc:title    "côté"@pt ;
    dc:title    "año"@es ;
    .
    
:en4
    dc:title    "tšekin kieli"@fi ;
    dc:title    "côte"@pt ;
    .

:en5
    dc:title    "töyhtöhyyppä"@fi ;
    dc:title    "côté"@pt ;
    .
    
:en6
    dc:title    "tšekin kieli"@fi ;
    dc:title    "ano"@es ;
    .

# ref: http://markmail.org/message/ig4w7wqkxsssgqdt
# ref: https://issues.apache.org/jira/browse/JENA-1313
# ref: http://demo.icu-project.org
{noformat}

Here's the query I am using to test it:

{code}
# see NodeUtils#compareLiteralsBySyntax and StrUtils#strCompare
SELECT * WHERE
{
  ?a ?b ?title
  FILTER(LANG(?title) = "fi")
}
ORDER BY ASC(?title)
{code}

For Portuguese and Spanish it seems to work fine (will play with some example 
values later). When you don't filter by language, then it does as I think Andy 
suggested, sorting by language first (i.e. return values in es, then pt, then 
fi).

So I think with the filter and the collate function, this query would bring the 
correct values for Finnish:

{code}
# should we use a different prefix?
PREFIX collate: <http://jena.hpl.hp.com/ARQ/collate#>

SELECT * WHERE
{
  ?a ?b ?title
  FILTER(LANG(?title) = "fi")
}
# something like this?
ORDER BY ASC(collate:fi(?title))
{code}

Though the behaviour when you have multiple languages, or no language at all 
would still need to be defined. I guess the default could be as-is now. And for 
multiple languages, keep sorting per language, but try to use collate:$LANG if 
it exists.

> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (JENA-1313) Language-specific collation in ARQ

Reply via email to