Hello Richard, Yup, I realize that the problem/confusion is more on my side. Anyway, thanks by clearing out stuff. I've worked around a little more, and I have two comments:
- The best approach is, in fact, as you said, to use the rdfs:label property of the resource. Regardless of the encoding hell that's happening on URIs, this should avoid it on SPARQL queries. - The properties indeed have a bug like you noticed. I'm sort of wrapping properties into objects with my own labels to tackle this problem. Are you planning a release soon? Cheers, Nuno Cardoso, PhD Student. http://xldb.di.fc.ul.pt/ncardoso www.tumba.pt - Search on the Portuguese Web! www.linguateca.pt - Distributed Resource Center for Portuguese Language Processing On Fri, Apr 3, 2009 at 16:23, Richard Cyganiak <[email protected]> wrote: > Nuno, > > On 2 Apr 2009, at 12:44, Nuno Cardoso wrote: > >> Can you please explain me the encoding procedures that DBpedia uses for >> the >> datasets? >> > > We use UTF-8 everywhere. > > It seems that everything was encoded from MacRoman, which is not a >> good thing for people working on non-Mac machines (I develop on a Mac, but >> the production environment will be a Linux, who does not have >> the slightest idea what is the MacRoman encoding). >> > > There is no MacRoman in DBpedia. Your text editor saves your source files > in UTF-8, but your compiler/interpreter interprets them in MacRoman. That > gives rise to the funny effects you are seeing. > > <snip> > >> So, I must force MacRoman to properly encode entities to DBpedia... >> SELECT * WHERE {<http://dbpedia.org/resource/Jos%C3%A9_Saramago> ?p ?o} >> > > That's UTF-8, not MacRoman. MacRoman would be Jos%8E_Saramago. MacRoman is > a single-byte encoding, thus the fact that the single character 'é' has been > encoded as two octets '%C3%A9' should already tell you that you're not > looking at MacRoman. The fact that US-ASCII characters remain unencoded > while other characters are multi-byte encoded is a very strong clue that > you're looking at UTF-8. > > I'm sure that you've thought about it, buy why MacRoman? Can't the >> datasets >> be in UTF-8? Encoding is essencial for GET parameters, but unencoded URLs >> work perfectly well in Wikipedia, for instance. I mean, the >> en.wikipedia.org/wiki/José_Saramago<http://en.wikipedia.org/wiki/Jos%C3%A9_Saramago> >> <http://en.wikipedia.org/wiki/Jos%C3%A9_Saramago>page >> works fine. >> > > I hope you realize how ironic your comments are! > > We use exactly the same encoding as Wikipedia: > http://en.wikipedia.org/wiki/Jos%C3%A9_Saramago > > There is no such thing as an unencoded URI. You simply cannot have a > character like é in a URI. When you enter > http://en.wikipedia.org/wiki/José_Saramango<http://en.wikipedia.org/wiki/Jos%C3%A9_Saramango> > into > your browser, then your browser automatically encodes the URI (using UTF-8 > follow by ) before sending it to the server. That's why it *appears* like > unencoded URIs work in your browser. In reality, they wouldn't even be valid > URIs. > > And the properties are even more difficult to parse. For isntance, the >> "population" word in Portuguese is população. >> The property is represented in datasets >> as popula_percent_E3_percent_A7_percent_E3_percent_A3o. >> That's most likeky a replaceAll("%","_percent_"), but why is it needed? >> > > It is needed because RDF property names that contain the percent character > cannot be serialised in RDF/XML. > > In general, to get from URIs to human-readable strings, don't mess around > with the URI, but find the rdfs:label property of the resource. This advice > holds for most RDF data. In DBpedia, it works for all instances (in the > /resource/ namespace), and it works for all properties from the > english-language infobox dataset. It does currently *not* work for > properties in other languages, because their labels are not loaded into the > DBpedia RDF store. But they are available for download. The portugiese one > is here: > http://downloads.dbpedia.org/3.2/pt/infoboxproperties_pt.nt.bz2 > > Unfortunately, the labels in the dump are badly broken: > < > http://dbpedia.org/property/popula_percent_E3_percent_A7_percent_E3_percent_A3o> > <http://www.w3.org/2000/01/rdf-schema#label> "popula_percent_ e3_percent_ > a7_percent_ e3_percent_ a3o" . > > The literal should read "população". It's a bug. > > (It would probably be a good idea for the DBpedia admins to load those > dumps into the store after the bug has been fixed.) > > Do you have any plans on future dataset releases to see if this "charset >> hell" is a little less painful? >> > > We already use UTF-8 everywhere. We cannot fix "charset hell" with new > dataset releases. "Charset hell" exists because most developers don't care > about Unicode or character encodings, even though they should. > Internationalization is hard, unfortunately. > > Best, > Richard > > > > >> >> Cheers, >> >> Nuno Cardoso >> >> === SCRIPT === >> import java.net.* >> >> def x = 'José Saramago' >> def a = 'Jos%C3%A9_Saramago' >> def hr = "------" >> >> def encoding = 'ISO-8859-1' >> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x, >> encoding) >> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x, >> encoding) >> println hr >> >> encoding = 'UTF-8' >> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x, >> encoding) >> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x, >> encoding) >> println hr >> >> encoding = 'MacRoman' >> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x, >> encoding) >> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x, >> encoding) >> println hr >> >> println "default encoding: "+System.getProperty("file.encoding") >> println "Encoding $x in default: "+java.net.URLEncoder.encode(x) >> println "Decoding $a in default: "+java.net.URLDecoder.decode(x) >> >> ========= >> Nuno Cardoso, PhD Student. >> http://xldb.di.fc.ul.pt/ncardoso >> >> www.tumba.pt - Search on the Portuguese Web! >> www.linguateca.pt - Distributed Resource Center for Portuguese Language >> Processing >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Dbpedia-discussion mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >> > >
------------------------------------------------------------------------------
_______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
