[
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949989#comment-15949989
]
Bruno P. Kinoshita edited comment on JENA-1313 at 3/31/17 2:01 PM:
-------------------------------------------------------------------
I liked Rob's suggestion too, of using a collate extension function. I used the
books.ttl sample turtle file from Jena, simplifying to contain a few words in
Finnish and in Portuguese:
{noformat}
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix : <http://example.org/data/> .
# cases:
# - repeated values
# - nodes missing translation (i.e. missing translations for one or more
languages)
# - literals in multiple languages (good for filter(lang()))
# Finnish words in expected ASC rder:
# - tsahurin kieli
# - tšekin kieli
# - tulun kieli
# - töyhtöhyyppä
# Brazilian Portuguese words in expected ASC order:
# - cote
# - coté
# - côte
# - côté
# Spanish words in expected ASC order:
# - ano
# - anos
# - año
:en1
dc:title "tsahurin kieli"@fi ;
dc:title "cote"@pt ;
dc:title "ano"@es ;
.
:en2
dc:title "tulun kieli"@fi ;
dc:title "coté"@pt ;
dc:title "anos"@es ;
.
:en3
dc:title "töyhtöhyyppä"@fi ;
dc:title "côté"@pt ;
dc:title "año"@es ;
.
:en4
dc:title "tšekin kieli"@fi ;
dc:title "côte"@pt ;
.
:en5
dc:title "töyhtöhyyppä"@fi ;
dc:title "côté"@pt ;
.
:en6
dc:title "tšekin kieli"@fi ;
dc:title "ano"@es ;
.
# ref: http://markmail.org/message/ig4w7wqkxsssgqdt
# ref: https://issues.apache.org/jira/browse/JENA-1313
# ref: http://demo.icu-project.org
{noformat}
Here's the query I am using to test it:
{code}
# see NodeUtils#compareLiteralsBySyntax and StrUtils#strCompare
SELECT * WHERE
{
?a ?b ?title
FILTER(LANG(?title) = "fi")
}
ORDER BY ASC(?title)
{code}
For Portuguese and Spanish it seems to work fine (will play with some example
values later). When you don't filter by language, then it does as I think Andy
suggested, sorting by language first (i.e. return values in es, then fi, then
pt).
So I think with the filter and the collate function, this query would bring the
correct values for Finnish:
{code}
# should we use a different prefix?
PREFIX collate: <http://jena.hpl.hp.com/ARQ/collate#>
SELECT * WHERE
{
?a ?b ?title
FILTER(LANG(?title) = "fi")
}
# something like this?
ORDER BY ASC(collate:fi(?title))
{code}
Though the behaviour when you have multiple languages, or no language at all
would still need to be defined. I guess the default could be as-is now. And for
multiple languages, keep sorting per language, but try to use collate:$LANG if
it exists.
was (Author: kinow):
I liked Rob's suggestion too, of using a collate extension function. I used the
books.ttl sample turtle file from Jena, simplifying to contain a few words in
Finnish and in Portuguese:
{noformat}
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix : <http://example.org/data/> .
# cases:
# - repeated values
# - nodes missing translation (i.e. missing translations for one or more
languages)
# - literals in multiple languages (good for filter(lang()))
# Finnish words in expected ASC rder:
# - tsahurin kieli
# - tšekin kieli
# - tulun kieli
# - töyhtöhyyppä
# Brazilian Portuguese words in expected ASC order:
# - cote
# - coté
# - côte
# - côté
# Spanish words in expected ASC order:
# - ano
# - anos
# - año
:en1
dc:title "tsahurin kieli"@fi ;
dc:title "cote"@pt ;
dc:title "ano"@es ;
.
:en2
dc:title "tulun kieli"@fi ;
dc:title "coté"@pt ;
dc:title "anos"@es ;
.
:en3
dc:title "töyhtöhyyppä"@fi ;
dc:title "côté"@pt ;
dc:title "año"@es ;
.
:en4
dc:title "tšekin kieli"@fi ;
dc:title "côte"@pt ;
.
:en5
dc:title "töyhtöhyyppä"@fi ;
dc:title "côté"@pt ;
.
:en6
dc:title "tšekin kieli"@fi ;
dc:title "ano"@es ;
.
# ref: http://markmail.org/message/ig4w7wqkxsssgqdt
# ref: https://issues.apache.org/jira/browse/JENA-1313
# ref: http://demo.icu-project.org
{noformat}
Here's the query I am using to test it:
{code}
# see NodeUtils#compareLiteralsBySyntax and StrUtils#strCompare
SELECT * WHERE
{
?a ?b ?title
FILTER(LANG(?title) = "fi")
}
ORDER BY ASC(?title)
{code}
For Portuguese and Spanish it seems to work fine (will play with some example
values later). When you don't filter by language, then it does as I think Andy
suggested, sorting by language first (i.e. return values in es, then pt, then
fi).
So I think with the filter and the collate function, this query would bring the
correct values for Finnish:
{code}
# should we use a different prefix?
PREFIX collate: <http://jena.hpl.hp.com/ARQ/collate#>
SELECT * WHERE
{
?a ?b ?title
FILTER(LANG(?title) = "fi")
}
# something like this?
ORDER BY ASC(collate:fi(?title))
{code}
Though the behaviour when you have multiple languages, or no language at all
would still need to be defined. I guess the default could be as-is now. And for
multiple languages, keep sorting per language, but try to use collate:$LANG if
it exists.
> Language-specific collation in ARQ
> ----------------------------------
>
> Key: JENA-1313
> URL: https://issues.apache.org/jira/browse/JENA-1313
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ
> Affects Versions: Jena 3.2.0
> Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users
> mailing list in October 2016, I would like to change ARQ collation of literal
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
> method.
> It currently sorts by lexical value first, then by language tag. Since the
> collation order needs to be stable across all possible literal values, I
> think the safest way would be to sort by language tag first, then by lexical
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different
> collation rules than the main language? It would be a bit strange if all
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in
> implementing it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)