[ 
https://issues.apache.org/jira/browse/COMMONSRDF-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818566#comment-15818566
 ] 

Stian Soiland-Reyes edited comment on COMMONSRDF-51 at 1/11/17 4:24 PM:
------------------------------------------------------------------------

I think this needs to be clarified on public-rdf-comme...@w3.org as our 
"character by character" is a [quote from the 
spec|https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal]:

{quote}

Literal term equality: Two literals are term-equal (the same RDF literal) if 
and only if the two lexical forms, the two datatype IRIs, and the two language 
tags (if any) compare equal, character by character. Thus, two literals can 
have the same value without being the same RDF term. For example:

      "1"^^xs:integer
      "01"^^xs:integer
    
denote the same value, but are not the same literal RDF terms and are not 
term-equal because their lexical form differs.
{quote}

It also says above the value space is always in lower case, but then says 
equality is done "character by character" -- not by value space.  (As that 
example shows, the lexical value of data types like integers are also compared 
by character instead of by value space)

I have nevertheless started a branch 
[COMMONSRDF-51-langtag-lcase|https://github.com/apache/commons-rdf/compare/COMMONSRDF-51-langtag-lcase]
 to try this out.. this revealed bugs in the bindings for simple (just the 
Turkish case), jsonld-java (which does no validation of language tags), rdf4j 
(fails Turkish test) and jena (fails Turkish test).

As both RDF4J and Jena are vulnerable to the Turkish case, that should be 
reported upstream after rdf-comments clarifications.

Would it make sense for Commons RDF to strengthen getLanguageTag() to ALWAYS 
return the language tag in lower case for any RDF implementations (e.g. 
normalize if implementation does not do it correctly internally) - as a kind of 
interoperability/RDF 1.1 measure - or should we strive to keep their current 
case representation as-is? 


was (Author: stain):
I think this needs to be clarified on public-rdf-comme...@w3.org as our 
"character by character" is a [quote from the 
spec|https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal]:

{quote}

Literal term equality: Two literals are term-equal (the same RDF literal) if 
and only if the two lexical forms, the two datatype IRIs, and the two language 
tags (if any) compare equal, character by character. Thus, two literals can 
have the same value without being the same RDF term. For example:

      "1"^^xs:integer
      "01"^^xs:integer
    
denote the same value, but are not the same literal RDF terms and are not 
term-equal because their lexical form differs.
{quote}

It also says above the value space is always in lower case, but then says 
equality is done "character by character" and not by value space.  (As that 
example shows, the lexical value of data types like integers are also compared 
by character instead of by value space)

I have nevertheless started a branch 
[COMMONSRDF-51-langtag-lcase|https://github.com/apache/commons-rdf/compare/COMMONSRDF-51-langtag-lcase]
 to try this out.. this revealed bugs in the bindings for simple (just the 
Turkish case), jsonld-java (which does no validation of language tags), rdf4j 
(fails Turkish test) and jena (fails Turkish test).

As both RDF4J and Jena are vulnerable to the Turkish case, that should be 
reported upstream after rdf-comments clarifications.

Would it make sense for Commons RDF to strengthen getLanguageTag() to ALWAYS 
return the language tag in lower case for any RDF implementations (e.g. 
normalize if implementation does not do it correctly internally) - as a kind of 
interoperability/RDF 1.1 measure - or should we strive to keep their current 
case representation as-is? 

> RDF-1.1 specifies that language tags need to be compared using lower-case
> -------------------------------------------------------------------------
>
>                 Key: COMMONSRDF-51
>                 URL: https://issues.apache.org/jira/browse/COMMONSRDF-51
>             Project: Apache Commons RDF
>          Issue Type: Bug
>          Components: api
>    Affects Versions: 0.3.0
>            Reporter: Peter Ansell
>            Assignee: Stian Soiland-Reyes
>
> The [RDF-1.1 specification states that the [value space of Literal language 
> tags is 
> lowercase|https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal], which 
> does not conflict with the case-insensitive specification in BCP47. The 
> Literal.equals and Literal.hashCode API contracts should specify that 
> language tags must be compared using lowercase, even if they are otherwise 
> stored and returned as upper-case by getLanguageTag. The API currently has 
> incorrect language by saying "character-by-character" for language tag 
> comparisons, as that implies case-sensitive comparisons are used.
> The lowercasing must also be done using a locale that is consistent (known 
> example where lowercase and uppercase do not roundtrip as expected for 
> US-ASCII characters is Turkish [1]), so I would recommend actually stating 
> that .toLowerCase(Locale.ENGLISH) is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to