Hello Richard,
Yup, I realize that the problem/confusion is more on my side. Anyway, thanks
by clearing out stuff.
I've worked around a little more, and I have two comments:

  - The best approach is, in fact, as you said, to use the rdfs:label
property of the resource. Regardless of the encoding hell that's happening
on URIs, this should avoid it on SPARQL queries.
  - The properties indeed have a bug like you noticed. I'm sort of wrapping
properties into objects with my own labels to tackle this problem. Are you
planning a release soon?

Cheers,


Nuno Cardoso, PhD Student.
http://xldb.di.fc.ul.pt/ncardoso

www.tumba.pt - Search on the Portuguese Web!
www.linguateca.pt - Distributed Resource Center for Portuguese Language
Processing





On Fri, Apr 3, 2009 at 16:23, Richard Cyganiak <[email protected]> wrote:

> Nuno,
>
> On 2 Apr 2009, at 12:44, Nuno Cardoso wrote:
>
>> Can you please explain me the encoding procedures that DBpedia uses for
>> the
>> datasets?
>>
>
> We use UTF-8 everywhere.
>
>  It seems that everything was encoded from MacRoman, which is not a
>> good thing for people working on non-Mac machines (I develop on a Mac, but
>> the production environment will be a Linux, who does not have
>> the slightest idea what is the MacRoman encoding).
>>
>
> There is no MacRoman in DBpedia. Your text editor saves your source files
> in UTF-8, but your compiler/interpreter interprets them in MacRoman. That
> gives rise to the funny effects you are seeing.
>
> <snip>
>
>> So, I must force MacRoman to properly encode entities to DBpedia...
>> SELECT * WHERE {<http://dbpedia.org/resource/Jos%C3%A9_Saramago> ?p ?o}
>>
>
> That's UTF-8, not MacRoman. MacRoman would be Jos%8E_Saramago. MacRoman is
> a single-byte encoding, thus the fact that the single character 'é' has been
> encoded as two octets '%C3%A9' should already tell you that you're not
> looking at MacRoman. The fact that US-ASCII characters remain unencoded
> while other characters are multi-byte encoded is a very strong clue that
> you're looking at UTF-8.
>
>  I'm sure that you've thought about it, buy why MacRoman? Can't the
>> datasets
>> be in UTF-8? Encoding is essencial for GET parameters, but unencoded URLs
>> work perfectly well in Wikipedia, for instance. I mean, the
>> en.wikipedia.org/wiki/José_Saramago<http://en.wikipedia.org/wiki/Jos%C3%A9_Saramago>
>> <http://en.wikipedia.org/wiki/Jos%C3%A9_Saramago>page
>> works fine.
>>
>
> I hope you realize how ironic your comments are!
>
> We use exactly the same encoding as Wikipedia:
> http://en.wikipedia.org/wiki/Jos%C3%A9_Saramago
>
> There is no such thing as an unencoded URI. You simply cannot have a
> character like é in a URI. When you enter
> http://en.wikipedia.org/wiki/José_Saramango<http://en.wikipedia.org/wiki/Jos%C3%A9_Saramango>
>  into
> your browser, then your browser automatically encodes the URI (using UTF-8
> follow by ) before sending it to the server. That's why it *appears* like
> unencoded URIs work in your browser. In reality, they wouldn't even be valid
> URIs.
>
>  And the properties are even more difficult to parse. For isntance, the
>> "population" word in Portuguese is população.
>> The property is represented in datasets
>> as popula_percent_E3_percent_A7_percent_E3_percent_A3o.
>> That's most likeky a replaceAll("%","_percent_"), but why is it needed?
>>
>
> It is needed because RDF property names that contain the percent character
> cannot be serialised in RDF/XML.
>
> In general, to get from URIs to human-readable strings, don't mess around
> with the URI, but find the rdfs:label property of the resource. This advice
> holds for most RDF data. In DBpedia, it works for all instances (in the
> /resource/ namespace), and it works for all properties from the
> english-language infobox dataset. It does currently *not* work for
> properties in other languages, because their labels are not loaded into the
> DBpedia RDF store. But they are available for download. The portugiese one
> is here:
> http://downloads.dbpedia.org/3.2/pt/infoboxproperties_pt.nt.bz2
>
> Unfortunately, the labels in the dump are badly broken:
> <
> http://dbpedia.org/property/popula_percent_E3_percent_A7_percent_E3_percent_A3o>
> <http://www.w3.org/2000/01/rdf-schema#label> "popula_percent_ e3_percent_
> a7_percent_ e3_percent_ a3o" .
>
> The literal should read "população". It's a bug.
>
> (It would probably be a good idea for the DBpedia admins to load those
> dumps into the store after the bug has been fixed.)
>
>  Do you have any plans on future dataset releases to see if this "charset
>> hell" is a little less painful?
>>
>
> We already use UTF-8 everywhere. We cannot fix "charset hell" with new
> dataset releases. "Charset hell" exists because most developers don't care
> about Unicode or character encodings, even though they should.
> Internationalization is hard, unfortunately.
>
> Best,
> Richard
>
>
>
>
>>
>> Cheers,
>>
>> Nuno Cardoso
>>
>> === SCRIPT ===
>> import java.net.*
>>
>> def x = 'José Saramago'
>> def a = 'Jos%C3%A9_Saramago'
>> def hr = "------"
>>
>> def encoding = 'ISO-8859-1'
>> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x,
>> encoding)
>> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x,
>> encoding)
>> println hr
>>
>> encoding = 'UTF-8'
>> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x,
>> encoding)
>> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x,
>> encoding)
>> println hr
>>
>> encoding = 'MacRoman'
>> println "Encoding $x in $encoding: "+java.net.URLEncoder.encode(x,
>> encoding)
>> println "Decoding $a in $encoding: "+java.net.URLDecoder.decode(x,
>> encoding)
>> println hr
>>
>> println "default encoding: "+System.getProperty("file.encoding")
>> println "Encoding $x in default: "+java.net.URLEncoder.encode(x)
>> println "Decoding $a in default: "+java.net.URLDecoder.decode(x)
>>
>> =========
>> Nuno Cardoso, PhD Student.
>> http://xldb.di.fc.ul.pt/ncardoso
>>
>> www.tumba.pt - Search on the Portuguese Web!
>> www.linguateca.pt - Distributed Resource Center for Portuguese Language
>> Processing
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>
>
------------------------------------------------------------------------------
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to