Re: [Dbpedia-discussion] DBpedia datasets and their encodings

Richard Cyganiak Fri, 03 Apr 2009 07:25:48 -0700

Nuno,

On 2 Apr 2009, at 12:44, Nuno Cardoso wrote:
> Can you please explain me the encoding procedures that DBpedia uses  
> for the
> datasets?


We use UTF-8 everywhere.

> It seems that everything was encoded from MacRoman, which is not a
> good thing for people working on non-Mac machines (I develop on a  
> Mac, but
> the production environment will be a Linux, who does not have
> the slightest idea what is the MacRoman encoding).

There is no MacRoman in DBpedia. Your text editor saves your source  
files in UTF-8, but your compiler/interpreter interprets them in  
MacRoman. That gives rise to the funny effects you are seeing.

<snip>
> So, I must force MacRoman to properly encode entities to DBpedia...
> SELECT * WHERE {<http://dbpedia.org/resource/Jos%C3%A9_Saramago> ?p ? 
> o}

That's UTF-8, not MacRoman. MacRoman would be Jos%8E_Saramago.  
MacRoman is a single-byte encoding, thus the fact that the single  
character 'é' has been encoded as two octets '%C3%A9' should already  
tell you that you're not looking at MacRoman. The fact that US-ASCII  
characters remain unencoded while other characters are multi-byte  
encoded is a very strong clue that you're looking at UTF-8.

> I'm sure that you've thought about it, buy why MacRoman? Can't the  
> datasets
> be in UTF-8? Encoding is essencial for GET parameters, but unencoded  
> URLs
> work perfectly well in Wikipedia, for instance. I mean, the
> en.wikipedia.org/wiki/José_Saramago<http://en.wikipedia.org/wiki/Jos%C3%A9_Saramago
>  
> >page
> works fine.

I hope you realize how ironic your comments are!

We use exactly the same encoding as Wikipedia: 
http://en.wikipedia.org/wiki/Jos%C3%A9_Saramago

There is no such thing as an unencoded URI. You simply cannot have a  
character like é in a URI. When you enter 
http://en.wikipedia.org/wiki/José_Saramango 
  into your browser, then your browser automatically encodes the URI  
(using UTF-8 follow by ) before sending it to the server. That's why  
it *appears* like unencoded URIs work in your browser. In reality,  
they wouldn't even be valid URIs.

> And the properties are even more difficult to parse. For isntance, the
> "population" word in Portuguese is população.
> The property is represented in datasets
> as popula_percent_E3_percent_A7_percent_E3_percent_A3o.
> That's most likeky a replaceAll("%","_percent_"), but why is it  
> needed?

It is needed because RDF property names that contain the percent  
character cannot be serialised in RDF/XML.

In general, to get from URIs to human-readable strings, don't mess  
around with the URI, but find the rdfs:label property of the resource.  
This advice holds for most RDF data. In DBpedia, it works for all  
instances (in the /resource/ namespace), and it works for all  
properties from the english-language infobox dataset. It does  
currently *not* work for properties in other languages, because their  
labels are not loaded into the DBpedia RDF store. But they are  
available for download. The portugiese one is here:
http://downloads.dbpedia.org/3.2/pt/infoboxproperties_pt.nt.bz2

Unfortunately, the labels in the dump are badly broken:
<http://dbpedia.org/property/popula_percent_E3_percent_A7_percent_E3_percent_A3o
 
 > <http://www.w3.org/2000/01/rdf-schema#label> "popula_percent_  
e3_percent_ a7_percent_ e3_percent_ a3o" .

The literal should read "população". It's a bug.

(It would probably be a good idea for the DBpedia admins to load those  
dumps into the store after the bug has been fixed.)

> Do you have any plans on future dataset releases to see if this  
> "charset
> hell" is a little less painful?

We already use UTF-8 everywhere. We cannot fix "charset hell" with new  
dataset releases. "Charset hell" exists because most developers don't  
care about Unicode or character encodings, even though they should.  
Internationalization is hard, unfortunately.

Best,
Richard



>
>
> Cheers,
>
> Nuno Cardoso
>
> === SCRIPT ===
> import java.net.*
>
> def x = 'José Saramago'
> def a = 'Jos%C3%A9_Saramago'
> def hr = "------"
>
> def encoding = 'ISO-8859-1'
> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x,  
> encoding)
> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x,  
> encoding)
> println hr
>
> encoding = 'UTF-8'
> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x,  
> encoding)
> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x,  
> encoding)
> println hr
>
> encoding = 'MacRoman'
> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x,  
> encoding)
> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x,  
> encoding)
> println hr
>
> println "default encoding: "+System.getProperty("file.encoding")
> println "Encoding $x in default: "+java.net.URLEncoder.encode(x)
> println "Decoding $a in default: "+java.net.URLDecoder.decode(x)
>
> =========
> Nuno Cardoso, PhD Student.
> http://xldb.di.fc.ul.pt/ncardoso
>
> www.tumba.pt - Search on the Portuguese Web!
> www.linguateca.pt - Distributed Resource Center for Portuguese  
> Language
> Processing
> ------------------------------------------------------------------------------
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion


------------------------------------------------------------------------------
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] DBpedia datasets and their encodings

Reply via email to