Allo Allo!

I ve been checking on dbpedia instead of working on pure wikipedia dump,
and got questions about consistency of ontologies against the dump of
original wikipedia.

See example:

1.

grep 'Terminator_2' mappingbased_properties_en.ttl

won't report the starring actors.

instead, see:

http://en.wikipedia.org/wiki/Terminator_2:_Judgment_Day


2.

page_links_en.ttl seems to report truncations:
http://en.wikipedia.org/wiki/Islamic_State_of_Iraq_and_the_Levant

is missing in:
Page_ids_en.ttl

while present in page_links_en.ttl and redirects_en.ttl

I found 'Islamic_State_of_Iraq_and_the_Levant
<http://en.wikipedia.org/wiki/Islamic_State_of_Iraq_and_the_Levant>' is
reported as: 'Islamic_State_of_Iraq
<http://en.wikipedia.org/wiki/Islamic_State_of_Iraq_and_the_Levant>' in
page_ids_en.ttl, but this leads to inconsistencies with the other files.

Is it a case of truncation or other errors?



3. Is there any way to synchronize live dbpedia with wikipedia (dbpedia
-live-mirror), without installing the whole virtuoso db.
just to extact data (via python ? ) one may  need from the original dump
(e.g. get the same .ttl files)



4. Eventually, does dbpedia use a lxml parser for wiki dump (whole dump)?
Would is be open source to adjust it ?
I tried to use it, but i fail since there are symbols like <C8> in the
middle of the <page> tag, not closed : they are not correct xml tags
(indeed they should be symbols for some char i think), and so the iteration
fail with not clean results.

Thank you so much, i hope my little notes could be of help too.

Luigi
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Dbpedia-gsoc mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to