Nuno,
On 2 Apr 2009, at 12:44, Nuno Cardoso wrote:
> Can you please explain me the encoding procedures that DBpedia uses
> for the
> datasets?
We use UTF-8 everywhere.
> It seems that everything was encoded from MacRoman, which is not a
> good thing for people working on non-Mac machines (I develop on a
> Mac, but
> the production environment will be a Linux, who does not have
> the slightest idea what is the MacRoman encoding).
There is no MacRoman in DBpedia. Your text editor saves your source
files in UTF-8, but your compiler/interpreter interprets them in
MacRoman. That gives rise to the funny effects you are seeing.
<snip>
> So, I must force MacRoman to properly encode entities to DBpedia...
> SELECT * WHERE {<http://dbpedia.org/resource/Jos%C3%A9_Saramago> ?p ?
> o}
That's UTF-8, not MacRoman. MacRoman would be Jos%8E_Saramago.
MacRoman is a single-byte encoding, thus the fact that the single
character 'é' has been encoded as two octets '%C3%A9' should already
tell you that you're not looking at MacRoman. The fact that US-ASCII
characters remain unencoded while other characters are multi-byte
encoded is a very strong clue that you're looking at UTF-8.
> I'm sure that you've thought about it, buy why MacRoman? Can't the
> datasets
> be in UTF-8? Encoding is essencial for GET parameters, but unencoded
> URLs
> work perfectly well in Wikipedia, for instance. I mean, the
> en.wikipedia.org/wiki/José_Saramago<http://en.wikipedia.org/wiki/Jos%C3%A9_Saramago
>
> >page
> works fine.
I hope you realize how ironic your comments are!
We use exactly the same encoding as Wikipedia:
http://en.wikipedia.org/wiki/Jos%C3%A9_Saramago
There is no such thing as an unencoded URI. You simply cannot have a
character like é in a URI. When you enter
http://en.wikipedia.org/wiki/José_Saramango
into your browser, then your browser automatically encodes the URI
(using UTF-8 follow by ) before sending it to the server. That's why
it *appears* like unencoded URIs work in your browser. In reality,
they wouldn't even be valid URIs.
> And the properties are even more difficult to parse. For isntance, the
> "population" word in Portuguese is população.
> The property is represented in datasets
> as popula_percent_E3_percent_A7_percent_E3_percent_A3o.
> That's most likeky a replaceAll("%","_percent_"), but why is it
> needed?
It is needed because RDF property names that contain the percent
character cannot be serialised in RDF/XML.
In general, to get from URIs to human-readable strings, don't mess
around with the URI, but find the rdfs:label property of the resource.
This advice holds for most RDF data. In DBpedia, it works for all
instances (in the /resource/ namespace), and it works for all
properties from the english-language infobox dataset. It does
currently *not* work for properties in other languages, because their
labels are not loaded into the DBpedia RDF store. But they are
available for download. The portugiese one is here:
http://downloads.dbpedia.org/3.2/pt/infoboxproperties_pt.nt.bz2
Unfortunately, the labels in the dump are badly broken:
<http://dbpedia.org/property/popula_percent_E3_percent_A7_percent_E3_percent_A3o
> <http://www.w3.org/2000/01/rdf-schema#label> "popula_percent_
e3_percent_ a7_percent_ e3_percent_ a3o" .
The literal should read "população". It's a bug.
(It would probably be a good idea for the DBpedia admins to load those
dumps into the store after the bug has been fixed.)
> Do you have any plans on future dataset releases to see if this
> "charset
> hell" is a little less painful?
We already use UTF-8 everywhere. We cannot fix "charset hell" with new
dataset releases. "Charset hell" exists because most developers don't
care about Unicode or character encodings, even though they should.
Internationalization is hard, unfortunately.
Best,
Richard
>
>
> Cheers,
>
> Nuno Cardoso
>
> === SCRIPT ===
> import java.net.*
>
> def x = 'José Saramago'
> def a = 'Jos%C3%A9_Saramago'
> def hr = "------"
>
> def encoding = 'ISO-8859-1'
> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x,
> encoding)
> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x,
> encoding)
> println hr
>
> encoding = 'UTF-8'
> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x,
> encoding)
> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x,
> encoding)
> println hr
>
> encoding = 'MacRoman'
> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x,
> encoding)
> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x,
> encoding)
> println hr
>
> println "default encoding: "+System.getProperty("file.encoding")
> println "Encoding $x in default: "+java.net.URLEncoder.encode(x)
> println "Decoding $a in default: "+java.net.URLDecoder.decode(x)
>
> =========
> Nuno Cardoso, PhD Student.
> http://xldb.di.fc.ul.pt/ncardoso
>
> www.tumba.pt - Search on the Portuguese Web!
> www.linguateca.pt - Distributed Resource Center for Portuguese
> Language
> Processing
> ------------------------------------------------------------------------------
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
------------------------------------------------------------------------------
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion