Hello,
Can you please explain me the encoding procedures that DBpedia uses for the
datasets? It seems that everything was encoded from MacRoman, which is not a
good thing for people working on non-Mac machines (I develop on a Mac, but
the production environment will be a Linux, who does not have
the slightest idea what is the MacRoman encoding).
Take for instance the DBpedia resource of the writer 'José Saramago',
represented as <dbpedia.org/resource/Jos%C3%A9_Saramago> on the datasets. I
normally work on UTF-8, and I had to write a little script (in the end of
the mail) to figure it out the encoding used, giving:
Encoding José Saramago in ISO-8859-1: Jos%3F%A9+Saramago
Decoding Jos%C3%A9_Saramago in ISO-8859-1: José Saramago
------
Encoding José Saramago in UTF-8: Jos%E2%88%9A%C2%A9+Saramago
Decoding Jos%C3%A9_Saramago in UTF-8: José Saramago
------
Encoding José Saramago in MacRoman: Jos%C3%A9+Saramago
Decoding Jos%C3%A9_Saramago in MacRoman: José Saramago
------
default encoding: MacRoman
Encoding José Saramago in default: Jos%C3%A9+Saramago
Decoding Jos%C3%A9_Saramago in default: José Saramago
So, I must force MacRoman to properly encode entities to DBpedia...
SELECT * WHERE {<http://dbpedia.org/resource/Jos%C3%A9_Saramago> ?p ?o}
I'm sure that you've thought about it, buy why MacRoman? Can't the datasets
be in UTF-8? Encoding is essencial for GET parameters, but unencoded URLs
work perfectly well in Wikipedia, for instance. I mean, the
en.wikipedia.org/wiki/José_Saramago<http://en.wikipedia.org/wiki/Jos%C3%A9_Saramago>page
works fine.
And the properties are even more difficult to parse. For isntance, the
"population" word in Portuguese is população.
The property is represented in datasets
as popula_percent_E3_percent_A7_percent_E3_percent_A3o.
That's most likeky a replaceAll("%","_percent_"), but why is it needed?
Do you have any plans on future dataset releases to see if this "charset
hell" is a little less painful?
Cheers,
Nuno Cardoso
=== SCRIPT ===
import java.net.*
def x = 'José Saramago'
def a = 'Jos%C3%A9_Saramago'
def hr = "------"
def encoding = 'ISO-8859-1'
println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x, encoding)
println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x, encoding)
println hr
encoding = 'UTF-8'
println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x, encoding)
println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x, encoding)
println hr
encoding = 'MacRoman'
println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x, encoding)
println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x, encoding)
println hr
println "default encoding: "+System.getProperty("file.encoding")
println "Encoding $x in default: "+java.net.URLEncoder.encode(x)
println "Decoding $a in default: "+java.net.URLDecoder.decode(x)
=========
Nuno Cardoso, PhD Student.
http://xldb.di.fc.ul.pt/ncardoso
www.tumba.pt - Search on the Portuguese Web!
www.linguateca.pt - Distributed Resource Center for Portuguese Language
Processing
------------------------------------------------------------------------------
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion