[jira] [Updated] (JENA-457) ntriples: Object-URIs should be %-encoded

Pascal Christoph (JIRA) Fri, 17 May 2013 02:57:26 -0700

     [ 
https://issues.apache.org/jira/browse/JENA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Pascal Christoph updated JENA-457:
----------------------------------

    Description: 
Ntriple serialization is in pure ASCII for now[1] , so IRIs are not possible as 
UTF8 is not allowed (see rfc3987). Serializing a Model to ntriples escapes 
non-ASCII characters with '\u' escaping. These URIs don't resolve in most cases 
per se, e.g. in dbpedia. These are the three different notations possible:

1. http://de.dbpedia.org/resource/T\u00FCr
2. http://de.dbpedia.org/resource/T%fcr
3. http://de.dbpedia.org/resource/Tür
[EDIT: rendering of 3. is broken,  see 
http://www.fileformat.info/info/unicode/char/00fc for more info )

While the 1. doesn't resolve and the 3. is not ASCII, the 2. (the percent-octet 
encoding) fulfills both requirements. So I would like to see the use of the 2. 
to encode object URIs in ASCII ntriple serialization. See also 
https://answers.semanticweb.com/questions/18508/best-way-to-encode-uri-refsiris-for-n-triples
 .

One could use jena to serialize as turtle and transform this turtle file to 
ntriples with rapper. But rapper encodes all literals having 
unicode-escape-sequences to utf8 ignoring the transformation of URIs (wisely, 
since they are identifier). So this does not help.

Some concrete code which is responsible for this serialization:

 RDFWriter fasterWriter = model.getWriter("N-TRIPLE");

Should be save to apply a patch like this in NTripleWriter.java:

private static void writeURIString(String s, PrintWriter writer) {
    writer.print(org.apache.commons.httpclient.util.URIUtil.encodeQuery(s) ) ;
}
(not tested)

What do you think?
-o

[1]see a month old note from W3C where it is proposed to use utf-8 instead of 
ASCII : http://www.w3.org/TR/2013/NOTE-n-triples-20130409/#n-triple-changes

  was:
Ntriple serialization is in pure ASCII for now[1] , so IRIs are not possible as 
UTF8 is not allowed (see rfc3987). Serializing a Model to ntriples escapes 
non-ASCII characters with '\u' escaping. These URIs don't resolve in most cases 
per se, e.g. in dbpedia. These are the three different notations possible:

1. http://de.dbpedia.org/resource/T\u00FCr
2. http://de.dbpedia.org/resource/T%fcr
3. http://de.dbpedia.org/resource/Tür

While the 1. doesn't resolve and the 3. is not ASCII, the 2. (the percent-octet 
encoding) fulfills both requirements. So I would like to see the use of the 2. 
to encode object URIs in ASCII ntriple serialization. See also 
https://answers.semanticweb.com/questions/18508/best-way-to-encode-uri-refsiris-for-n-triples
 .

One could use jena to serialize as turtle and transform this turtle file to 
ntriples with rapper. But rapper encodes all literals having 
unicode-escape-sequences to utf8 ignoring the transformation of URIs (wisely, 
since they are identifier). So this does not help.

Some concrete code which is responsible for this serialization:

 RDFWriter fasterWriter = model.getWriter("N-TRIPLE");

Should be save to apply a patch like this in NTripleWriter.java:

private static void writeURIString(String s, PrintWriter writer) {
    writer.print(org.apache.commons.httpclient.util.URIUtil.encodeQuery(s) ) ;
}
(not tested)

What do you think?
-o

[1]see a month old note from W3C where it is proposed to use utf-8 instead of 
ASCII : http://www.w3.org/TR/2013/NOTE-n-triples-20130409/#n-triple-changes

    
> ntriples: Object-URIs should be %-encoded
> -----------------------------------------
>
>                 Key: JENA-457
>                 URL: https://issues.apache.org/jira/browse/JENA-457
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ, Jena, RDF API
>    Affects Versions: ARQ 2.9.3
>         Environment: everywhere
>            Reporter: Pascal Christoph
>            Priority: Minor
>              Labels: patch
>
> Ntriple serialization is in pure ASCII for now[1] , so IRIs are not possible 
> as UTF8 is not allowed (see rfc3987). Serializing a Model to ntriples escapes 
> non-ASCII characters with '\u' escaping. These URIs don't resolve in most 
> cases per se, e.g. in dbpedia. These are the three different notations 
> possible:
> 1. http://de.dbpedia.org/resource/T\u00FCr
> 2. http://de.dbpedia.org/resource/T%fcr
> 3. http://de.dbpedia.org/resource/Tür
> [EDIT: rendering of 3. is broken,  see 
> http://www.fileformat.info/info/unicode/char/00fc for more info )
> While the 1. doesn't resolve and the 3. is not ASCII, the 2. (the 
> percent-octet encoding) fulfills both requirements. So I would like to see 
> the use of the 2. to encode object URIs in ASCII ntriple serialization. See 
> also 
> https://answers.semanticweb.com/questions/18508/best-way-to-encode-uri-refsiris-for-n-triples
>  .
> One could use jena to serialize as turtle and transform this turtle file to 
> ntriples with rapper. But rapper encodes all literals having 
> unicode-escape-sequences to utf8 ignoring the transformation of URIs (wisely, 
> since they are identifier). So this does not help.
> Some concrete code which is responsible for this serialization:
>  RDFWriter fasterWriter = model.getWriter("N-TRIPLE");
> Should be save to apply a patch like this in NTripleWriter.java:
> private static void writeURIString(String s, PrintWriter writer) {
>     writer.print(org.apache.commons.httpclient.util.URIUtil.encodeQuery(s) ) ;
> }
> (not tested)
> What do you think?
> -o
> [1]see a month old note from W3C where it is proposed to use utf-8 instead of 
> ASCII : http://www.w3.org/TR/2013/NOTE-n-triples-20130409/#n-triple-changes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-457) ntriples: Object-URIs should be %-encoded

Reply via email to