Jens and all,

I'm moving this thread which started off-list to the DBpedia  
discussion list.

It's about weird characters in some infobox dumps, especially the  
japanese infobox dump. I'll start with a summary of the problem.

The names of template variables in some infoboxes have international  
characters. For example, a german-language infobox might have a  
template variable "größe" (meaning "size").

Our current approach to deal with this is to:

1. apply the standard Wikipedia %-encoding to the variable name in  
order to create a property URI in the <http://dbpedia.org/property/>  
namespace, e.g.

     <http://dbpedia.org/property/gr%C3%B6%C3%9Fe>

2. convert the %XX triplets into _percent_XX because property URIs  
containing the % character cannot be serialized as RDF/XML. The  
property URI now looks like this:

     
<http://dbpedia.org/property/gr_percent_C3_percent_B6_percent_C3_percent_9Fe 
 >

The problem is that we end up with some very long and ugly property  
names! Certainly no one would want to use properties like this in a  
SPARQL query! They also mess up the Pubby HTML view.

This hasn't been a big problem when we only did infobox extraction  
from the English Wikipedia, because it contains very few of those  
troublesome template names, but it's a huge problem now that we do  
infobox extraction for many languages.

On 15 Feb 2008, at 12:11, Jens Lehmann wrote:
> Richard Cyganiak schrieb:
>> On 14 Feb 2008, at 13:39, Georgi Kobilarov wrote:
>>> as far as I remember, the problem is the serialization to RDF/XML.
>>> % is not allowed in predicate URIs.
>> Just wanted to confirm this. The idea is to get rid of the %  
>> character in property URIs, to make sure that the RDF is  
>> serializable as RDF/XML.
>> Properties with special characters were extremely rare in the  
>> english infobox dump at the time we introduced this, so we didn't  
>> consider the ugliness to be a problem. They are of course much more  
>> frequent in the infobox dumps for other languages.
>
> What would be a good way to get rid of them? Currently, they are  
> apparantly replaced by _percent_, which is not nice in some  
> languages, so we should replace them with something which is  
> serialisable to RDF/XML.

I can think of three possible approaches:

1. Simply drop the troublesome triples. If a triple can't be  
serialized as RDF/XML, just ignore the triple. We considered this for  
the English infobox extraction, but obviously this is not a good  
solution for the international infobox dumps which have a lot of those  
triples.

2. Encode the % character as something shorter than "_percent_", e.g.  
just a simple dash. This is still ugly but at least not so long:

     <http://dbpedia.org/property/gr-C3-B6-C3-9Fe>

The characters we can use without problems are: letters, digits,  
underscore "_" and dash "-".

3. Use real international characters in the URI, e.g.

     <http://dbpedia.org/property/größe>

I'm not quite sure if this is possible. RDF supposedly supports IRIs  
(the new style of i18ned URIs that can contain Unicode letters), and  
XML can certainly use these characters in element names, so it  
*should* be possible. But this is somewhat uncharted territory,  
someone would have to dig through the relevant specs to see what  
exactly is or is not allowed, and from prior experience I would expect  
a lot of trouble with tools in our toolchain that are not quite  
Unicode-ready.

> If anyone can provide a fix and commit it to SVN, I could regenerate  
> the problematic data sets. If that would require regenerating all  
> data sets, then it may be better to defer this to the next release  
> in 2 months.

I guess if we want to go with option 2, then this could be done  
quickly. Option 3 would probably have to wait for the next release. We  
may also decide to do 2 now and look into 3 later.

What do you all think? What should we do?

Richard



>
>
> Jens


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to