Hi DBpedia people,

I've been using the latest (3.7) Portuguese dumps to build a custom 
dbpedia index for Stanbol's embedded SOLR 
(http://incubator.apache.org/stanbol/).

I really don't know if this is the right place to say this, basically 
I've been messing with the generated index and I've found a number of 
cases where the abstract/comment in PT is all messed up, for example:

http://dbpedia.org/page/Livonian_Brothers_of_the_Sword
http://dbpedia.org/page/Kingdom_of_Poland_%281916%E2%80%931918%29

So tracking these back to Wikipedia, I can see that in these cases there 
is no Infobox/wiki markup, instead there is LOTS of HTML that results in 
something similar being rendered.

However when extracted this makes the comments/abstracts full of HTML 
markup :(

So I hope this helps identify weak spots in Portuguese dbpedia/wikipedia 
or at least can be redirected by someone to someone that cares (Pablo 
Mendes?). At the moment I don't know if there are more cases like this 
or how many of the articles are like this, I'll go on testing and report 
back.

Best,
Alex

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to