Re: [Dbpedia-discussion] Encoding error in german labels file

Jörn Hees Tue, 20 Sep 2011 07:01:52 -0700

Hi Daniel,

sorry for the late reply.
I think your question is quite important as most issues with RDF seem to be 
encoding related.


On 12. Sep. 2011, at 18:48, Gerber Daniel wrote:
> But still, why are the URIs not encoded this way?

short answer: because that would have been too easy and it's not how it 
developed over time :)


longer answer: 

RDF came along quite some time after URIs and initially was tied to its XML 
serialization.
Now, xml already had it's way to escape non ascii values: the \uxxxx or 
\UXXXXXXXX .
This kind of escaping for literals also made it into several other format such 
as ntriples, n3 and turtle.

Now to the "URIs"...
>> <http://dbpedia.org/resource/Emperor_Ninkō>  
>> <http://www.w3.org/2000/01/rdf-schema#label>  "Nink\u014D"@de .


Your terminology here is not very precise, probably because we all tend to be 
lazy and just call everything URI which looks like one :). To explain this 
precise terms help:
The <http://dbpedia.org/resource/Emperor_Ninkō> is an "IRI Reference" (it even 
is an IRI as it's not relative). They were formally called "RDF URI reference" 
(not to be confused with "URI reference" from the URI RFC) in anticipation of 
the IRI RFC:
http://www.w3.org/TR/rdf-concepts/#dfn-URI-reference
> A URI reference within an RDF graph (an RDF URI reference) is a Unicode 
> string [UNICODE] that:
> 
>       • does not contain any control characters ( #x00 - #x1F, #x7F-#x9F)
>       • and would produce a valid URI character sequence (per RFC2396 [URI], 
> sections 2.1) representing an absolute URI with optional fragment identifier 
> when subjected to the encoding described below.
> The encoding consists of:
> 
>       • encoding the Unicode string as UTF-8 [RFC-2279], giving a sequence of 
> octet values.
>       • %-escaping octets that do not correspond to permitted US-ASCII 
> characters.
> [...]
> Note: this section anticipates an RFC on Internationalized Resource 
> Identifiers. Implementations may issue warnings concerning the use of RDF URI 
> References that do not conform with [IRI draft] or its successors.

Now what does this mean? 
The <http://dbpedia.org/resource/Emperor_Ninkō> is a Unicode string (!= UTF-8 
String), which can be turned into a valid "URI character sequence" by following 
the steps described above.
In order to dereference such an IRI we need to transform it into its URI 
equivalent and then use HTTP. In other words:
From the IRI rfc sec. 1.2.a: http://tools.ietf.org/html/rfc3987
> "On the other hand, in the HTTP protocol [RFC2616], the Request URI is 
> defined as a URI, which means that direct use of IRIs is not allowed in HTTP 
> requests."

This means that while it is allowed to identify things in RDF with IRIs it 
isn't possible to look them up without prior encoding as %-escaped UTF-8 
string, which then is a (ASCII) URI.

Now, you might remember that you can just copy the 
<http://dbpedia.org/resource/Emperor_Ninkō> into your browser and get results. 
Correct, but that's because most browsers do the IRI -> URI magic under the 
hood so you don't see that they actually request 
http://dbpedia.org/page/Emperor_Nink%C5%8D . (In Firefox hit CTRL + i (win) or 
CMD + i (mac)).

Cheers,
Jörn


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Encoding error in german labels file

Reply via email to